State-of-the-Art Survey

Hannah Bast Claudius Korzen Ulrich Meyer Manuel Penschuck (Eds.)

# LNCS 13201

# **Algorithms for Big Data**

**DFG Priority Program 1736**

# Lecture Notes in Computer Science 13201

# Founding Editors

Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA

# Editorial Board Members

Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Moti Yung Columbia University, New York, NY, USA

More information about this series at https://link.springer.com/bookseries/558

Hannah Bast • Claudius Korzen • Ulrich Meyer • Manuel Penschuck (Eds.)

# Algorithms for Big Data

# DFG Priority Program 1736

Editors Hannah Bast University of Freiburg Freiburg im Breisgau, Germany

Ulrich Meyer Goethe University Frankfurt Frankfurt, Germany

Claudius Korzen University of Freiburg Freiburg, Germany

Manuel Penschuck Goethe University Frankfurt Frankfurt, Germany

Goethe-Universität Frankfurt am Main

ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-031-21533-9 ISBN 978-3-031-21534-6 (eBook) https://doi.org/10.1007/978-3-031-21534-6

© The Editor(s) (if applicable) and The Author(s) 2022. This book is an open access publication.

Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

# Preface

Computer systems pervade all parts of human activity: transportation systems, energy supply, medicine, the whole financial sector, and modern science have become unthinkable without hardware and software support. As these systems continuously acquire, process, exchange, and store data, we live in a big-data world where information is accumulated at an exponential rate.

The urgent problem has shifted from collecting enough data to dealing with its impetuous growth and abundance. In particular, data volumes often grow faster than the transistor budget of computers as predicted by Moore's law (i.e., doubling every 18 months). On top of this, we cannot any longer rely on transistor budgets to automatically translate into application performance, since the speed improvement of single processing cores has basically stalled and the requirements of algorithms that use the full memory hierarchy get more and more complicated. As a result, algorithms have to be massively parallel using memory access patterns with high locality. Furthermore, an x-times machine performance improvement only translates into x-times larger manageable data volumes if we have algorithms that scale nearly linearly with the input size. All these are challenges that need new algorithmic ideas. Last but not least, to have maximum impact, one should not only strive for theoretical results, but intend to follow the whole algorithm engineering development cycle consisting of theoretical work followed by experimental evaluation.

The "curse" of big data in combination with increasingly complicated hardware has reached all kinds of application areas: genomics research, information retrieval (web search engines, ...), traffic planning, geographical information systems, or communication networks. Unfortunately, most of these communities do not interact in a structured way even though they are often dealing with similar aspects of big-data problems. Frequently, they face poor scale-up behaviour from algorithms that have been designed based on models of computation that are no longer realistic for big data.

# About the SPP 1736

This volume surveys the progress in selected aspects of this important and growing field. It emerged from a research program established by the German Research Foundation (DFG) as priority program SPP 1736 on Algorithmics for Big Data (https:// www.big-data-spp.de) in 2013 where researchers from theoretical computer science worked together with application experts in order to tackle some of the problems discussed above.

The research program was prepared collaboratively by Susanne Albers, Hannah Bast, Kurt Mehlhorn, Ulrich Meyer (coordinator), Eugene Myers, Peter Sanders, Christian Scheideler, and Martin Skutella. The first meetings took place in Frankfurt/Main in 2012. Subsequently a grant proposal was worked out, submitted to the DFG on October 15, and the program was granted in the spring meeting of the DFG Senat in 2013. The duration of the program was six years, divided into two periods of three years each.

A nationwide call for the individual projects attracted over 40 proposals out of which an international reviewer panel selected 15 funded research projects plus a coordination project (totalling about 20 full PhD student positions) by the end of 2013. Additionally, a few more projects with their own funding were associated in order to benefit from collaboration and joint events (workshops, PhD meetings, summer schools etc.) organised by the SPP. The members of the priority programme produced about 300 publications with more than 8200 citations by May 2022.

# About This Book

The chapters of this volume summarize results of projects realized within the program and survey-related work. More than half of them centrally deal with various aspects of algorithms for large and complex networks:


Using instances of a particular type of random graphs discussed in the network generation chapter as a null model, the LA algorithm evaluates the structural similarities between the nodes, and thus differentiates meaningful relationships between nodes from noisy ones. After a detailed discussion of the algorithmic foundations (Chapter 3), the authors present the design of a dedicated hardware accelerator (Chapter 4) for solving the LA problem, which—compared to an Intel cluster—uses 38 less memory and is 1030 more energy efficient.


The topics of the chapters in the second part of this volume range from challenges in scalable cryptography, data streams, and energy-efficient scheduling to generic optimization and text (pre)processing including applications:

– In "Scalable Cryptography" (Chapter 9) Dennis Hofheinz and Eike Kiltz shed light on the quest for cryptographic methods that keep on working for significantly increased data set sizes. The security guarantees of currently used RSA encryption technology, for example, degrade linearly in the number of users and ciphertexts. This limits their applicability to smaller data sets or requires significantly larger keylengths which in turn slows down and complicates the whole process (in particular if the keylengths are to grow dynamically).

The authors discuss a number of settings in which it is possible to provide alternative scalable cryptographic building blocks. In particular, they survey SPP work on the construction of scalable public-key encryption schemes (a central cryptographic building block that helps secure communication), but also briefly mention other settings such as "reconfigurable cryptography".

– In "Distributed Data Streams" (Chapter 10) Jannik Castenow, Björn Feldkord, Jonas Hanselle, Till Knollmann, Manuel Malatyali, and Friedhelm Meyer auf der Heide consider a big data scenario where a server is wirelessly connected to a huge number of sensor nodes that continuously measure data. At each time step the server needs to calculate a function defined over the current measurements of the sensors.

Due to the sensors' restricted compute and battery power, the communication between server and sensors has to be optimized, for example by minimizing the total number of messages using clever randomized protocols. The authors review SPP results for three concrete functions: Top-k-Value Monitoring, Top-k-Position Monitoring, and (Approximate) Count Distinct Monitoring.


The GENO software generates a solver from a specification of an optimization problem, i.e. objective function and constraints are specified in a formal language. The problem specification is then translated into a general normal form, which in turn is passed on to a general purpose solver with optimized support for various hardware platforms including GPUs by carefully integrated BLAS (Basic Linear Algebra Subroutines) calls. The authors show that by putting all the components together the generated solvers are competitive with problem-specific hand-written solvers and orders of magnitude faster than competing approaches that offer comparable ease-of-use.

– In "Algorithms for Big Data Problems in de Novo Genome Assembly" (Chapter 13) Anand Srivastav, Axel Wedemeyer, Christian Schielke, and Jan Schiemann address some algorithmic problems related to genome assembly.

Concretely speaking they first present an algorithm which significantly reduces the input data size without practically impacting the assembly quality. They then turn to the important subproblem of efficiently counting k-mers for which they provide an external-memory solution. Further reconstruction steps boil down to the longest path problem and the Eulerian tour problem. In order to tackle those they present a linear time (per edge) streaming algorithm for heuristically constructing long paths in undirected graphs, and a streaming algorithm for the Euler tour problem with optimal one-pass complexity.

– In "Scalable Text Index Construction" (Chapter 14) Timo Bingmann, Patrick Dinklage, Johannes Fischer, Florian Kurpicz, Enno Ohlebusch, and Peter Sanders discuss the current state of the art in large-scale computation of text-indices.

When treating distributed, external, and shared memory approaches for different text indices and their applications the authors point out common techniques that are used in different models of computation or in the computation of different text indices. While most of the discussed work solely focuses on the construction of the text indices, they also show approaches to actually answer queries on text indices in distributed memory. In addition they discuss real-world applications in bioinformatics and text compression and future challenges.

We would like to thank all authors who submitted their work, the referees for their helpful comments, as well as the DFG for accepting and sponsoring the priority program SPP 1736 on Algorithms for Big Data. We hope that this volume will prove useful for further research in big data algorithms.

May 2022 Hannah Bast Claudius Korzen Ulrich Meyer Manuel Penschuck

# Organization

# Reviewers

Ulrik Brandes ETH Zurich, Switzerland Friedhelm Meyer auf der Heide

Susanne Albers Technische Universität München, Germany Eugenio Angriman Humboldt-Universität zu Berlin, Germany Hannah Bast Albert-Ludwigs-Universität Freiburg, Germany Timo Bingmann Karlsruhe Institute of Technology, Germany Holger Dell Goethe University Frankfurt, Germany Michael Hamann Karlsruhe Institute of Technology, Germany Till Knollmann Paderborn University, Germany Oliver Koch University of Münster, Germany Nils Kriege University of Vienna, Austria Florian Kurpicz Technische Universität Dortmund, Germany Sebastian Lamm Karlsruhe Institute of Technology, Germany Sören Laue Friedrich-Schiller-Universität Jena, Germany Ulrich Meyer Goethe University Frankfurt, Germany Paderborn University, Germany

Henning Meyerhenke Humboldt-Universität zu Berlin, Germany Matthias Mnich Hamburg University of Technology, Germany Manuel Penschuck Goethe University Frankfurt, Germany Knut Reinert Freie Universität Berlin, Germany Peter Sanders Karlsruhe Institute of Technology, Germany Christian Schindelhauer Albert-Ludwigs-Universität Freiburg, Germany Christoph Scholl Albert-Ludwigs-Universität Freiburg, Germany Christian Schulz Heidelberg University, Germany Anand Srivastav Christian-Albrechts-Universität Kiel, Germany Alexander van der Grinten Humboldt-Universität zu Berlin, Germany Dorothea Wagner Karlsruhe Institute of Technology, Germany Axel Wedemeyer Christian-Albrechts-Universität Kiel, Germany Katharina Zweig TU Kaiserslautern, Germany

# Contents

# Algorithms for Large and Complex Networks




# **Algorithms for Large and Complex Networks**

# **Algorithms for Large-Scale Network Analysis and the NetworKit Toolkit**

Eugenio Angriman1(B) , Alexander van der Grinten1 , Michael Hamann2, Henning Meyerhenke<sup>1</sup> , and Manuel Penschuck<sup>3</sup>

> <sup>1</sup> Humboldt-Universität zu Berlin, Berlin, Germany {angrimae,avdgrinten,meyerhenke}@hu-berlin.de <sup>2</sup> Karlsruhe Institute of Technology, Karlsruhe, Germany michael.hamann@kit.edu <sup>3</sup> Goethe University Frankfurt, Frankfurt am Main, Germany mpenschuck@ae.cs.uni-frankfurt.de

**Abstract.** The abundance of massive network data in a plethora of applications makes scalable analysis algorithms and software tools necessary to generate knowledge from such data in reasonable time. Addressing scalability as well as other requirements such as good usability and a rich feature set, the opensource software NETWORKIT has established itself as a popular tool for largescale network analysis. This chapter provides a brief overview of the contributions to NETWORKIT made by the SPP 1736. Algorithmic contributions in the areas of centrality computations, community detection, and sparsification are in the focus, but we also mention several other aspects – such as current software engineering principles of the project and ways to visualize network data within a NETWORKIT-based workflow.

**Keywords:** Network analysis · Algorithms · Software package

# **1 Introduction**

Network phenomena surround us, be they social contact networks, organizational structures, or infrastructure networks such as the energy grid, roads or the (physical) internet. Purely virtual networks such as the world wide web, online social networks, or co-authorship networks can become particularly large and play an ever increasing role in our daily lives [8,62]. Traditional data analysis has been and is very successful in discovering knowledge from non-network (e.g., geometric or relational) data [50]. Yet, networks and their analysis are about "dependence, both between and within variables" [26]. To uncover implicit dependencies hidden in the data, it thus requires appropriate algorithmic techniques (some of which are also covered in Leskovec et al.'s textbook on mining massive datasets [50]).

Massive networks, often with billions of vertices and edges, pose challenges to many established analysis concepts and algorithms due to their prohibitive computational costs. This leads to the ongoing development of efficient and scalable algorithms. The open-source software package NETWORKIT<sup>1</sup> [75 SPP] aims to combine a broad

<sup>1</sup> https://networkit.github.io/.

range of such algorithms for the analysis of large networks and to make them accessible via consistent, easy to use, and well-documented frontends. For instance, it offers a feature-rich Python API which integrates into the large Python ecosystem for data analysis. Under the hood, the heavy lifting is carried out by performance-oriented algorithms that are implemented in C++ and often use multicore parallelism. The package is also well suited to develop and evaluate novel algorithmic approaches. As such, NET-WORKIT received numerous unique scalable algorithms and implementations in recent years, particularly designed to handle large inputs.

In this chapter, we present a high-level overview of NETWORKIT (Sect. 2) and portray algorithmic research results derived with and for NETWORKIT – mostly those obtained by projects of SPP 1736. We cover four main topics: centrality algorithms (Sect. 3), community detection (Sect. 4), graph sparsification (Sect. 5) as well as graph drawing and network visualization (Sect. 6). While these have been focus areas of NET-WORKIT development during the lifetime of SPP 1736, the package has been used in various other application contexts such as quantum chemistry [56 SPP] and digital humanities [47].

#### **2 NetworKit—An Overview**

NETWORKIT is in development since 2013. The architecture of the current codebase was released in 2014. At the time of writing, NETWORKIT has a regular release cycle with two new major releases per year. Staudt et al. [75 SPP] describe the package's state at the end of 2015. In this section, we consequently focus on the many additions of new functionality as well as improvements to the code quality that have been realized in the meantime. This concerns new performance-oriented graph algorithms, engineering to speed up existing algorithms, more software engineering guidelines and best practices, as well as the modernization and extension of NETWORKIT's integration with other tools within a rich ecosystem (as detailed in Sect. 2.2).

#### **2.1 Design Considerations**

NETWORKIT consists of several Python modules wrapping an independently usable core library that is written in C++. Both parts are connected using Cython and are tightly integrated to offer consistent interfaces for most features. The package is organized into multiple modules, each focusing on one (class of) network analytic problem(s). Important modules deal with network centrality (centrality), community detection (community and scd) as well as graph generation and perturbation (generators and randomization). Some novel algorithms in the centrality, community, and sparsification modules that were developed within SPP 1736 are described in more detail in Sects. 3 to 5. Other important modules that are not covered here include modules for graph algorithms in the language of linear algebra (algebraic, following the philosophy of GraphBLAS [45 SPP]), decomposition of graphs into components (components), distance computations (distance), reading and writing graphs (io), link prediction (linkprediction), graph coarsening (coarsening), and more.

As a graph data structure, NETWORKIT uses an adjacency array using dynamic arrays (std::vector) to store vertices and their neighborhoods. It also supports edge weights and edge IDs. This data structure was chosen over static ones such as CSR matrices since it allows for efficient dynamic updates. The design is complemented by several non-trivial algorithms that can efficiently update their results if the underlying graph changes (i.e., after adding and/or deleting edges).

Many of NETWORKIT's algorithms use OPENMP for shared-memory parallelism. In fact, several algorithms in NETWORKIT exhibit best-in-class parallel performance [36 SPP]. Based on an empirical comparison [46 SPP] between NETWORKIT and several distributed frameworks for data and network analysis, NETWORKIT's speed advantage usually remains true in comparison to distributed systems with eight-fold resource consumption. Ref. [46 SPP] finds that a shared-memory machine is sufficient to solve many network analytic problems on real-world instances and concludes that shared-memory parallelism should be preferred to distributed graph algorithms as long as the input graph fits into main memory.

#### **2.2 Ecosystem**

In recent years, NETWORKIT matured into an actively maintained open-source project with more than 140 000 lines of code and a steadily growing number of users and contributors. By now, the software package exceeds a critical size that warrants efforts beyond the development of new algorithmic features.

To ease contributions and uphold the code quality, NETWORKIT offers detailed guidelines and implements a thorough review process. We also make heavy use of unittests, static code analysis and automated code-formatting as part of our continuous integration pipeline, which targets the three major operating systems. As many new tests improve the coding standards, we continuously modernize the codebase. Still, backwards compatibility is a major concern and manifests itself, for instance, in long-term compiler support and in as few changes breaking the API as possible (preceded by a deprecation period of at least one major version release).

Users benefit from a welcoming community, ever-improving documentation, interactive examples showcasing most features, a regular release schedule, and growing support for package managers (currently brew, Conda, pip, and Spack). NETWORKIT naturally interacts with external projects such as GEPHI (see Sect. 6), SIMEXPAL [4 SPP], and NETWORKX as well as graph repositories and formats including KONECT, SNAP, and METIS; recent changes make it now even possible to develop standalone NET-WORKIT Python modules.

Graph data can not only be imported but also be synthesized. To this end, NETWORKIT offers versatile graph generators in the modules generators and randomization. Among others, they are designed to generate and supplement datasets for applications ranging from rapid prototyping to experimental campaigns. Here, we only mention the supported network *models* since Chap. 2 surveys novel generation *algorithms* obtained during SPP 1736. We include here citations to models or generators developed for/with NETWORKIT.

– Focus on community structure: Clustered-Random-Graph, LFR, PubWeb, R-MAT, Stochastic Block Model, Watts-Strogatz


Several generators have dynamic variants simulating the evolution of graphs over time.

# **3 Centrality Algorithms**

One of the most popular concepts used for the analysis of a graph *G* = (*V*,*E*) is *centrality*. Centrality measures assign a score to each vertex2 (or group of vertices) based on its structural position or importance; these scores allow a corresponding vertex ranking [21]. As an example, the well-known PageRank [27] is a centrality measure originally devised for web page (and eventually search query) ranking. It is important to match the underlying research question with the appropriate centrality measure [77 SPP] and no single measure is universal. Thus, dozens of measures have been proposed in the literature [21].

As described in more detail below, the centrality research within NETWORKIT revolves not only around faster algorithms for computing individual scores and top-*k* rankings. Another emphasis is placed on two families of centrality-driven optimization problems (centrality improvement and group centrality) and how to scale approximation algorithms or heuristics for their solution to much larger input sizes. For a broader overview, also with a scalability focus, the reader is referred to Ref. [35 SPP].

It should also be noted that fast centrality algorithms can be useful in different (but related) contexts as well; e.g., scores of several centrality measures are used as shortcuts for more expensive influence maximization calculations [70 SPP]. Also, using score distributions for graph fingerprinting (putting graphs into classes where all members have similar distributions) is a conceivable use case with the need for numerous measures that can be computed quickly.

#### **3.1 Individual Centrality Scores**

We first discuss centrality measures for individual vertices, i.e., measures that assign a centrality score to each *v* ∈ *V*. During SPP 1736, our focus has been on two classes of centrality measures: centralities that make use of shortest path computations (i.e., (harmonic) closeness and betweenness) and algebraic centrality measures that consider more than just shortest paths (like Katz centrality and electrical closeness). Figure 1 depicts the distribution of these centralities for a single network, including the ED Walk centrality that we propose in Ref. [3 SPP].

<sup>2</sup> Edge centrality measures are ignored here in the interest of space.

**Fig. 1.** Histograms of the distribution of vertex centrality measures of the JAZZ network, which models the collaboration of Jazz musicians [34].

**Betweenness.** Betweenness centrality is based on the fraction of shortest paths a vertex participates in. NETWORKIT implements the well-known Brandes algorithm [23] for exact betweenness and several algorithms for betweenness approximation. For static graphs, it has an implementation of the KADABRA algorithm [22]; additionally, NET-WORKIT can approximate betweenness in dynamic graphs [15 SPP]. Both of these algorithms employ a sampling technique that was originally introduced by Riondato and Kornaropoulos [66]. More precisely, the algorithms sample pairs (*s*,*t*) of source and target vertices uniformly at random. For each (*s*,*t*), a single shortest path is sampled uniformly at random out of all shortest *s*-*t* paths. The algorithms count the number of occurrences of vertices on these paths; they differ in their stopping conditions. The multi-threaded implementation of the static KADABRA algorithm additionally exploits a fast data structure for asynchronous synchronization barriers [36 SPP]. To the best of our knowledge, NETWORKIT's implementation of KADABRA is the fastest betweenness approximation code that is available for multi-threaded machines.

In Ref. [39 SPP], this algorithm was extended to work with replicated graphs in distributed memory. The resulting algorithm obtains good parallel speedups and performs well even on multi-socket shared memory machines due to the fact that it can avoid NUMA bottlenecks. Since distributed memory algorithms are outside the scope of NETWORKIT, this implementation is available externally.

**Closeness.** Closeness centrality also uses the notion of shortest paths: it quantifies the importance of a vertex *v* ∈ *V* depending on how close *v* is to all the other vertices of the graph [11]. It is defined as *c*(*v*) := (*n* − 1)/(∑*w*∈*<sup>V</sup> d*(*v*,*w*)) and computing it for a single vertex requires to run a single-source shortest path (SSSP) algorithm. The textbook algorithm to identify the top-*k* vertices with highest closeness centrality computes *c*(*v*) for each vertex of the graph by running *n* SSSPs, which is impractical for large-scale networks. NETWORKIT improves on this by providing an algorithm which finds the top-*k* vertices with highest closeness centrality along with their exact value of *c*(·) [12 SPP]. Even though the worst-case running time of the algorithm is also Ω(|*V*||*E*|), experimental evaluation on real-world data shows that, for small values of *k*, the algorithm is in practice much more efficient than the textbook algorithm and other state-of-the-art strategies.

NETWORKIT additionally implements a batch-dynamic version of this algorithm [18 SPP,2 SPP], which also addresses harmonic centrality [21,67] – an alternative definition of closeness centrality introducing support for disconnected graphs. Experiments on both real-world and synthetic instances demonstrate that, for moderately large batches of edge updates, the dynamic algorithm is up to four orders of magnitude faster than a static recomputation from scratch.

**Electrical Closeness.** Electrical resistance is a distance function on graphs that is constructed by interpreting the graph as a network of electrical resistors and by measuring the effective resistance between vertices in this network. If the usual distance function (based on shortest-path distances) in the definition of closeness is replaced by effective resistance, one obtains the definition of *electrical* closeness. This centrality measure has been gaining attention due to the fact that it considers paths of any length. NETWORKIT has an efficient approximation algorithm to compute electrical closeness [6 SPP]. This algorithm exploits a well-known connection between electrical networks and uniform spanning trees to approximate electrical closeness faster than previous numerical algorithms (including the numerical algorithm from Ref. [17 SPP]) and can handle graphs with hundreds of millions of edges.

As part of our work on electrical closeness, NETWORKIT gained support for various numerical algorithms. These are typically either used as subprocedures of our algorithms or for performance and/or quality comparisons; however, they can also be called as standalone numerical solvers. Experiments with an (in terms of theoretical analysis) fast Laplacian solver revealed severe limitations in practice [43 SPP] – which is why it was discarded. Instead, we included a fast implementation [17 SPP] of the lean algebraic multigrid algorithm (LAMG) [51], which is particularly well-suited to solve series of Laplacian linear systems with identical system matrices.

**Katz Centrality.** NETWORKIT also implements an approximation algorithm for Katz centrality that can handle graphs with billions of edges within a few minutes [38 SPP]. The algorithm utilizes lower and upper bounds on the centrality score of each vertex and improves these bounds until the Katz centrality ranking is computed with sufficient precision. In comparison to earlier combinatorial algorithms for Katz centrality, our algorithm is the first to obtain a provable approximation bound and/or the correctness of the ranking. It is also at least 50% faster than numerical methods. NETWORKIT provides a parallel implementation of this algorithm that can also handle dynamic graphs. In Ref. [38 SPP], we additionally provide a GPU-based implementation which is not part of NETWORKIT.

#### **3.2 Improving One's Own Centrality**

One possible way to improve one's ranking position in a web search is to attract links from influential web pages. For some time, this led to so-called link farming [49] for search engine optimization. More generally, beyond web search, one wants to increase the centrality of a vertex by adding a specified number of new edges incident to it. Crescenzi et al. [30] addressed this problem for closeness centrality. As a follow-up to that work, Ref. [13 SPP] considered two betweenness centrality improvement problems: maximizing the betweenness *score* of a given vertex (MBI) and maximizing the *ranking position* of a given vertex (MRI). The paper proves that both problems are hard to approximate. Unless *P* = *NP*, MBI cannot be approximated within a factor greater than 1<sup>−</sup> <sup>1</sup> <sup>2</sup>*<sup>e</sup>* and for MRI there is no α-approximation algorithm for any constant α ≤ 1. The paper also proposes a simple greedy algorithm for MBI that performs well in practice and provides a (1−1/*e*)-approximation. This way, MBI can be approximated for (most) networks with up to 105 edges in a matter of seconds or a few minutes. The greedy algorithm's implementation builds, among others, upon a dynamic algorithm for betweenness centrality [16 SPP] that can update the betweenness scores of all vertices much faster after small graph changes (such as the insertion of one or few edges).

#### **3.3 Group Centrality Optimization**

*Group centralities* are network-analytic measures that quantify the importance of vertex groups [31]. In contrast to centrality measures that apply to individual vertices, the goal of these measures is to determine how well the entire group jointly "covers" the graph; i.e., the group centrality score is *not* determined by the scores of individual vertices.

NETWORKIT includes various group centrality algorithms to approximate sets of vertices that maximize the group centrality score. Most of the algorithms are based on submodular optimization. For example, NETWORKIT implements a greedy algorithm to approximate group degree and the group betweenness maximization algorithm by Mahmoody et al. [57]. New algorithms developed as part of SPP 1736 are the GED-Walk approximation algorithms from Ref. [3 SPP] and various group closeness algorithms; these algorithms are described below. A very recent addition to NETWORKIT is an approximation algorithm for group forest closeness centrality; for details we refer to Ref. [37 SPP].

**Group Closeness.** Group closeness measures the importance of a *group* of vertices *S* ⊂ *V* as the reciprocal of the sum of the distances from *S* to the vertices in *V* \ *S*, where the distance from *S* to a vertex *v* ∈ *V* is defined by the minimum *d*(*S*,*v*) := min*u*∈*<sup>S</sup> d*(*u*,*v*). Finding the group *S* with highest group closeness is known to be an *NP*-hard optimization problem [29,1 SPP]. Thus, in practice, the problem is addressed on large-scale networks either with heuristics or approximation algorithms. NETWORKIT provides a greedy heuristic [14 SPP] that computes a set of vertices with high group centrality. On small enough instances where it is feasible to compute the optimum, it has been shown that the algorithm yields solutions with nearly optimal quality.

An alternative heuristic, which allows to trade quality for speed, is based on local search. NETWORKIT implements a family of local search heuristics for group closeness maximization that achieve different trade-offs between quality and running time [5 SPP]. In general, they are one to three orders of magnitude faster than the greedy algorithm. At the same time, our algorithms retain 80%—and, in numerous cases, even more than 99%—of the greedy algorithm's solution quality. NETWORKIT also includes the first approximation algorithm for group closeness maximization [1 SPP] (for undirected graphs) which yields solutions with higher quality than the greedy algorithm at the cost of additional running time.

A major limitation of group closeness is that it can only handle (strongly) connected graphs – the distance between unreachable vertices is either undefined or infinite, and an infinite denominator results in group closeness score of zero. Another group centrality measure that also handles disconnected graphs is group harmonic centrality, which is defined as *GH*(*S*) :<sup>=</sup> <sup>∑</sup>*u*∈*V*\*<sup>S</sup> <sup>d</sup>*(*S*,*u*)−1. Maximizing *GH* has been shown to be an *NP*-hard problem [1 SPP] as well and two approximation algorithms for group harmonic maximization have been introduced in Ref. [1 SPP]; both of them are available in NETWORKIT.

**GED-Walk.** GED-Walk (GED = group exponentially decaying) is an algebraic group centrality measure that was introduced in Ref. [3 SPP]. Similarly to Katz centrality (which only applies to individual vertices), GED-Walk counts the number of *walks* (and not paths) in the graph. Unlike Katz centrality, it counts walks that *cross* the group of vertices (instead of counting walks that *start* (or end) at certain vertices). Computing GED scores can essentially be done via sparse matrix-vector multiplication; hence, the measure can be computed faster than centrality measures that involve the computation of shortest paths. In Ref. [3 SPP], we propose a greedy algorithm that computes a group with approximately maximal GED-Walk centrality. The algorithmic approach is based on techniques derived from our Katz algorithm [38 SPP] and iteratively refines bounds on the group centrality score. In experiments, GED-Walk maximization turns out to be at least one order of magnitude faster than the corresponding greedy algorithms for group betweenness and group closeness. When applied within semi-supervised vertex classification, GED-Walk improves the accuracy compared to various existing measures.

# **4 Community Detection**

Community detection aims to detect subgraphs that are internally densely and externally sparsely connected. From this fuzzy idea, many formalizations and algorithms have been developed [32]. A division of the graph into disjoint communities is the most frequently studied setting. The most popular quality measure for this setting is modularity [63]. As it is *NP*-hard to find the (clustering with) optimal modularity score [24], heuristics are used in practice. A very popular one is the Louvain algorithm [20]. While it is already quite fast, it is purely sequential in its original formulation and thus does not exploit the many cores available in modern processors. Already the earliest work in NETWORKIT includes the development of a parallel variant of the Louvain algorithm named PLM [72]. This first work also includes a fast parallel label propagation algorithm named PLP and an ensemble algorithm that combines several runs of PLP with a final step where PLM is used. Later improvements to PLM, including the parallelization of additional steps, made PLM so fast that it outperformed the ensemble approach both in terms of speed and quality [74 SPP]. Further, a refinement round similar to Ref. [68] has been introduced that further increases the quality at the expense of a slightly longer running time. PLM was later used in a case study on correspondences between clusterings [33 SPP]. With such correspondences one can reveal how one clustering differs from another one, e.g., when computed with different algorithms or after minor graph changes.

If only a community around a specific vertex or a set of vertices (so-called seed vertices) is desired, we do not need to detect communities that cover the whole graph. Many such algorithms greedily add new vertices until a local minimum of a certain quality function is reached. A first study on such local community detection algorithms [71 SPP] based on NETWORKIT has shown that they are quite slow and imprecise in comparison to PLM. A more recent study [41 SPP] shows that many local community algorithms detect a community in which the seed is not strongly connected. Only algorithms that employ further guidance, e.g., using edge scores based on triangles, are able to correctly identify a community the seed vertex is embedded in. The study further shows that the results of all local community detection algorithms can be improved by starting with the largest clique in the subgraph induced by the neighbors of the seed vertex. For this, the possibility to combine two local community detection algorithms has been added to NETWORKIT – a first one that detects the clique and then a second one that expands this clique into a community [41 SPP]. This allows changing both the seeding strategy and the latter expansion step.

For the experimental evaluation of community detection algorithms, suitable input instances are required [7]. Ideally, instances from applications of community detection with known ground truth communities should be used for this. However, they are frequently either quite small, unavailable due to privacy concerns or commercial interests, or the available ground truth data cannot be recovered from the graph's structure [32,65]. For this reason, synthetically generated benchmark graphs with generated ground truth communities are frequently used. The most popular one is the LFR benchmark graph generator [48], of which NETWORKIT also provides an implementation for the case of unweighted, undirected graphs with disjoint communities [73 SPP] (see also Chap. 2). Due to a partial parallelization and more efficient data structures, experiments show a speedup compared to the original implementation of 18 to 70 using 16 cores [73 SPP]. When the similarity between a detected and a (possible) ground truth community is low, it is often not clear if such a similarity could also be achieved by chance. Therefore, Hamann et al. [41 SPP] also introduced a simple baseline algorithm using a BFS that stops when the same number of vertices as contained in the ground truth community have been visited and returns them as community. Together with additional methods for the evaluation of the found communities, NETWORKIT thus provides a comprehensive framework for the development, evaluation, and application of local community detection algorithms.

Nastos and Gao [61] suggest quasi-threshold graphs, i.e., graphs that do not contain a path or cycle of four vertices as vertex-induced subgraph, as a model for communities in social networks. As a given graph is usually not a quasi-threshold graph, they suggest to insert and delete as few edges as possible to transform a graph into a quasi-threshold graph. The connected components are then considered as communities. The first scalable heuristic for this problem [25 SPP] has been implemented in NETWORKIT, for details we refer to Chap. 7.

# **5 Graph Sparsification**

Centrality measures suggest that certain vertices or edges are more important than others. In graph *sparsification*, the idea is to exploit this fact to obtain a subset of the vertices and/or edges that preserve key properties of the graph, i.e., to select vertices and edges that are important for these properties. Properties of the graph can be preserved either directly or in a scaled version. For example, the degree distribution cannot be exactly preserved when we remove edges, but we can preserve the general shape of the degree distribution. Graph sparsification can provide insights into the structure of a graph, as it provides insights on how much redundancy there is and which edges are important for certain properties. An application of these insights is speeding up other network analysis tasks or making them possible in the first place by reducing the graph's size such that the running time and memory requirements are reduced [69]. Further, some of these sparsification techniques can also remove noise from the graph such that, e.g., more informative drawings can be generated [64 SPP]. In NETWORKIT, we provide a set of edge sparsification algorithms [40 SPP]. Given a graph *G* = (*V*,*E*), they identify subsets *E* ⊂ *E* of the edges such that *G* = (*V*,*E* ) preserves certain properties of *G*. We currently do not consider vertex sparsification, i.e., filtering vertices while maintaining properties of the graph – since in many network analysis tasks (like vertex centralities or community detection), we are interested in a result for every vertex. If some vertices were no longer part of the graph, we would need to extrapolate their results, requiring an additional post-processing step for every network analysis task.

With its diverse set of network analysis algorithms, NETWORKIT provides the ideal testbed for sparsification algorithms. A study [40 SPP] compares a set of six existing and one novel sparsification algorithm as well as five novel variants of the existing algorithms using NETWORKIT. The study shows that these sparsification algorithms can be classified into three groups: those that primarily preserve edges within densely connected areas, those that primarily preserve connectivity between different areas, and those that are almost or completely random. The algorithms in the first group strengthen the formation of communities and either keep or increase the average local clustering coefficient as already suggested by previous work [69,64 SPP]. The novel local degree technique, on the other hand, keeps distances in the graph and thus the diameter small, see Fig. 2 for an example. As the results show, it is also good at preserving vertex centralities. Completely random filtering also works surprisingly well at preserving a wide range of network properties. The study shows that all methods perform better for most measures if, instead of directly filtering edges globally, a vertex of degree *d* keeps its top *d<sup>e</sup>* neighbors for some exponent *e* < 1. This local filtering step has been proposed before [69] for a single sparsification algorithm and the study suggests to apply it to all considered algorithms. In particular, this preserves connectivity of the graph quite well and in general leads to a more even distribution of the preserved edges.

All of these sparsification algorithms can be decomposed into two steps: A first step that assigns each edge a score and a second step that only keeps a certain fraction of the highest-rated edges. Even the local filtering step can be implemented as a

**Fig. 2.** Drawing using GEPHI [9] of the JAZZ network [34] (left) and a sparsified version containing 15% of the edges (right) using the novel local degree algorithm. Vertex size and color is proportional to degree. (Color figure online)

transformation of edge scores. This makes it possible to easily combine existing and new algorithms. Further, the resulting scores can be considered as edge centrality measures that permit a ranking of the edges. With the help of visualization software like GEPHI [9] (Sect. 6), the scores can also be visualized or used for interactive filtering of edges.

# **6 Graph Drawing and Network Data Visualization**

In exploratory network analysis, one needs to evaluate several properties of the network, which requires writing code to run algorithms and plot their results. To speed up this process, NETWORKIT provides a dedicated profiling module that allows non-expert users to run several network analysis algorithms as a single program and visualize their results in a graphical report that can be rendered in a Jupyter Notebook or exported as an HTML or a LATEX document. As thoroughly explained in Ref. [75 SPP], first the report lists global properties of the networks such as the size and the density. Then it provides an overview of the distribution of several centrality networks as histograms (as shown in Fig. 1, Sect. 3), followed by a more detailed statistical analysis. Finally, the report includes a matrix with the Spearman correlation coefficients between the rankings of the vertices according to the considered centrality measures; an example for the JAZZ network is shown in Fig. 3.

When dealing with large graphs, statistical overviews as the ones mentioned are indispensable, since the well-known vertex-edge diagrams do not even scale to graphs of medium size (without further adjustments). For small graphs, however, visualizations such as those diagrams can be very valuable. In general, the goal of graph visualization [10] is to represent graphs in a form that is meaningful to the human eye. Popular


**Fig. 3.** Spearman's correlation coefficients between vertex rankings obtained with different centrality measures for the JAZZ network. Darker [lighter] block shades indicate higher [smaller] correlation values.

**Fig. 4.** Visualization example with GEPHI of the KARATE graph. Red vertices have the highest harmonic centrality, blue vertices the lowest. (Color figure online)

application areas for graph visualization are biology (e.g., genetic maps), chemistry (e.g., protein functions) [42], social network analysis [47], and many more. GEPHI [9] is a popular Java-based GUI application to explore and visualize graphs. NETWORKIT's gephi module [40 SPP] allows to use GEPHI to visualize graphs along with additional vertex- or edge attributes with minimal effort. Figure 4 shows the visualization in GEPHI of the popular KARATE graph obtained by the ForceAtlas2 graph drawing algorithm [44] and by coloring the vertices according to their harmonic centrality score.

Graph drawing actually precedes visualization in most cases. It is the process of computing meaningful coordinates for the graph vertices where such information is not supplied with the graph. NETWORKIT's approach for the most part is to use the graph drawing capability in GEPHI. It has, however, also an implementation of an algorithm for the maxent-stress objective function, following Ref. [58 SPP]. Here, the main intention is to solve an optimization problem that computes the three-dimensional structure of biomolecules, given distance information between some atom pairs. To this end, the original algorithm received several application-specific adaptations [76 SPP], e.g., to be able to handle noisy data appropriately. As a result, the new algorithm by far outperforms its competitors in terms of speed and flexibility, and often even produces a superior solution quality.

# **7 Conclusions**

The main design goals of NETWORKIT (speed, rich feature set, usability, and integration into an ecosystem) prove to be very useful for users, but they can also be challenging for the developers. One lesson learned to keep an academic open-source project of this size manageable and alive, is to combine best practices in both software engineering and algorithm engineering [4 SPP]. For example, a proper modularization allows easier reuse and combination of components, leading to a better extensibility and maintainability. These keywords are well-known in software engineering, but they also have their effect in algorithm design and implementation – in particular a simplified exploration of the design space in experimental algorithmics. NETWORKIT has already proved to be very useful in this respect for developers.

We have seen that approximation and parallelism can bring us a long way regarding scalability. They are the obvious, but certainly not the only choices for acceleration: exploiting the structure of the data, e.g., small vs. large diameter [12 SPP], can yield significant speedups on real-world data—even in the context of exact computations and potentially on top of parallelism.

NETWORKIT is constantly improved and extended – according to the resources available to the project. There are numerous ideas for larger updates from various angles – of which we mention only two representative ones: inherent support for attributes within (some of) the algorithms and further/improved integration with other tools. The latter is particularly geared towards a closer connection with machine learning, both on an algorithmic and a software tool level. Given the current interest in machine learning for data analysis, complete workflows within one seamless toolchain including NET-WORKIT and tools such as SCIKIT-LEARN can be expected to be very attractive for users from many domains.

# **References**

	- 20. Blondel, V.D., Guillaume, J., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech: Theory Exp. **2008**(10), P10008 (2008). https://doi. org/10.1088/1742-5468/2008/10/p10008
	- 21. Boldi, P., Vigna, S.: Axioms for centrality. Internet Math. **10**(3–4), 222–262 (2014). https://doi.org/10.1080/15427951.2013.865686
	- 22. Borassi, M., Natale, E.: KADABRA is an adaptive algorithm for betweenness via random approximation. ACM J. Exp. Algorithmics **24**(1), 1.2:1–1.2:35 (2019). https://doi. org/10.1145/3284359
	- 23. Brandes, U.: A faster algorithm for betweenness centrality. J. Math. Sociol. **25**(2), 163– 177 (2001). https://doi.org/10.1080/0022250X.2001.9990249
	- 24. Brandes, U., et al.: On modularity clustering. IEEE Trans. Knowl. Data Eng. **20**(2), 172–188 (2008). https://doi.org/10.1109/TKDE.2007.190689
	- 29. Chen, C., Wang, W., Wang, X.: Efficient maximum closeness centrality group identification. In: Cheema, M.A., Zhang, W., Chang, L. (eds.) ADC 2016. LNCS, vol. 9877, pp. 43–55. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46922-5\_4
	- 30. Crescenzi, P., D'Angelo, G., Severini, L., Velaj, Y.: Greedily improving our own closeness centrality in a network. ACM Trans. Knowl. Discov. Data **11**(1), 9:1–9:32 (2016). https://doi.org/10.1145/2953882
	- 31. Everett, M.G., Borgatti, S.P.: The centrality of groups and classes. J. Math. Sociol. **23**(3), 181–201 (1999). https://doi.org/10.1080/0022250X.1999.9990219
	- 32. Fortunato, S., Hric, D.: Community detection in networks: a user guide. Phys. Rep. **659**, 1–44 (2016). https://doi.org/10.1016/j.physrep.2016.09.002
	- 34. Gleiser, P.M., Danon, L.: Community structure in jazz. Adv. Complex Syst. **6**(4), 565– 574 (2003). https://doi.org/10.1142/S0219525903001067
	- 42. Herman, I., Melançon, G., Marshall, M.S.: Graph visualization and navigation in information visualization: a survey. IEEE Trans. Vis. Comput. Graph. **6**(1), 24–43 (2000). https://doi.org/10.1109/2945.841119
	- 47. Kreutel, J.: Augmenting network analysis with linked data for humanities research. In: Kremers, H. (ed.) Digital Cultural Heritage, pp. 1–14. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-15200-0\_1
	- 48. Lancichinetti, A., Fortunato, S., Radicchi, F.: Benchmark graphs for testing community detection algorithms. Phys. Rev. E **78**, 046110 (2008). https://doi.org/10.1103/ PhysRevE.78.046110
	- 49. Langville, A.N., Meyer, C.D.: Google's PageRank and Beyond The Science of Search Engine Rankings. Princeton University Press, Princeton (2006)
	- 50. Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets, 2nd edn. Cambridge University Press, Cambridge (2014)
	- 51. Livne, O.E., Brandt, A.: Lean algebraic multigrid (LAMG): fast graph Laplacian linear solver. SIAM J. Sci. Comput. **34**(4), B499–B522 (2012). https://doi.org/10.1137/ 110843563
	- 57. Mahmoody, A., Tsourakakis, C.E., Upfal, E.: Scalable betweenness centrality maximization via sampling. In: KDD, pp. 1765–1773. ACM (2016). https://doi.org/10.1145/ 2939672.2939869
	- 59. Mocnik, F.B.: The polynomial volume law of complex networks in the context of local and global optimization. Sci. Rep. **8**(1), 1–10 (2018). https://doi.org/10.1038/s41598- 018-29131-0
	- 60. Mocnik, F.-B., Frank, A.U.: Modelling spatial structures. In: Fabrikant, S.I., Raubal, M., Bertolotto, M., Davies, C., Freundschuh, S., Bell, S. (eds.) COSIT 2015. LNCS, vol. 9368, pp. 44–64. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23374-1\_3
	- 65. Peel, L., Larremore, D.B., Clauset, A.: The ground truth about metadata and community detection in networks. Sci. Adv. **3**(5), e1602548 (2017). https://doi.org/10.1126/sciadv. 1602548
	- 66. Riondato, M., Kornaropoulos, E.M.: Fast approximation of betweenness centrality through sampling. In: WSDM, pp. 413–422. ACM (2014). https://doi.org/10.1145/ 2556195.2556224
	- 67. Rochat, Y.: Closeness centrality extended to unconnected graphs: the harmonic centrality index. In: ASNA, Applications of Social Network Analysis (2009)
	- 68. Rotta, R., Noack, A.: Multilevel local search algorithms for modularity clustering. ACM J. Exp. Algorithmics **16**, 27 (2011). https://doi.org/10.1145/1963190.1970376
	- 69. Satuluri, V., Parthasarathy, S., Ruan, Y.: Local graph sparsification for scalable clustering. In: SIGMOD Conference, pp. 721–732. ACM (2011). https://doi.org/10.1145/ 1989323.1989399
	- 72. Staudt, C., Meyerhenke, H.: Engineering high-performance community detection heuristics for massive graphs. In: ICPP, pp. 180–189. IEEE Computer Society (2013). https://doi.org/10.1109/ICPP.2013.27

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Generating Synthetic Graph Data from Random Network Models**

Ulrich Meyer and Manuel Penschuck(B)

Goethe University Frankfurt, Frankfurt, Germany {umeyer,mpenschuck}@ae.cs.uni-frankfurt.de

**Abstract.** Network models are developed and used in various fields of science as their design and analysis can improve the understanding of the numerous complex systems we can observe on an everyday basis. From an algorithmics point of view, structural insights into networks can guide the engineering of tailor-made graph algorithms required to face the *big data* challenge.

By design, network models describe graph classes and therefore can often provide meaningful synthetic instances whose applications include experimental case studies. While there exist public network libraries with numerous datasets, the available instances do not fully satisfy the needs of experimenters, especially pertaining to size and diversity. As several SPP 1736 projects engineered practical graph algorithms, multiple sampling algorithms for various graph models were designed and implemented to supplement experimental campaigns. In this chapter, we survey the results obtained for these so-called graph generators. This chapter is partially based on[43 SPP].

**Keywords:** Random graphs · Graph generator · Sampling · Parallel · Distributed · External memory

# **1 Motivation**

Networks are the very fabric that makes societies [5,40]. As such, humanity is seeking to understand their structures, rules, and implications for centuries (see also Chapter 1). The practical importance of networks, however, only sky-rocketed with the advent of the information age. Nowadays, modern computers offer sufficient storage and processing capacity to map out most aspects of human life and the world we inhabit. They are fed by billions of interconnected sensors and computerized personal devices that produce enormous volumes of network data to be exploited.

Computer science provides the means to face this *big data challenge*. However, a formal grammar capturing the inner structure of the data expected to be processed is required to provide tailor-made solutions. Network models are just that: a mathematical tool to describe and analyze realistic graphs. Research into and applications of these models are deeply intertwined with various fields of science.

Networks are commonly modeled by so-called *random graphs* and, therefore, represent probability distributions over the set of graphs [8]. These distributions are almost always parametrized (e.g., for the graph size or density) and typically follow implicitly from some randomized construction algorithm. Popular models are designed such that we1 can expect certain topological properties from a randomly drawn instance: a particularly interesting goal is to reproduce the loosely defined class of *complex networks* which, among others, encompasses most social networks.

By expressing network models as random graphs, we inherit a rich set of tools from combinatorics, stochastics, and graph theory. In algorithmics we may, for instance, assume that meaningful inputs are random instances of a *suitable* network model. Then we can derive realistic formal performance predictions using average-case analysis, smoothed complexity, et cetera. In practice, such results tend to be more relevant than worst-case analysis based on pathologic structures that are implausible in applications.

Network models also enable or supplement experimental campaigns as a versatile source of synthetic data with controllable independent variables. Synthetic benchmarks are especially useful in the context of large instances where real data is typically unavailable in sufficient size, quantity, or variety. Even if the data exists, procuring and archiving it may be difficult for legal or technical reasons; this threatens the independent reproducibility of results and thus infringes on one of science's cornerstones [45].

#### **1.1 Structure**

In Sects. 1.2 and 1.3, we introduce the definitions and notation used in this chapter. The main part of the chapter is then organized by the network model type. Section 2 discusses the notation of *random graphs* in detail and introduces sampling algorithms for the *G* (*n*, *p*) and *G* (*n*,*m*) models.

Sections 3 to 5 deal with random graph classes that focus on the distribution of degrees. Preferential attachment models, and especially the BA model by Barabási and Albert, explain the emergence of powerlaw degree distributions in growing networks; we discuss suitable sampling algorithms, so-called (graph) generators, in Sect. 3. The R-MAT, also capable of producing powerlaw degree distributions, is presented in Sect. 4. In Sect. 5, we consider several solutions for the following problem: given a list of degrees, produce a uniform sample from the set of all simple graphs that satisfy these degrees. Section 6 discusses geometrically embedded random graphs including the popular Random Hyperbolic Graphs.

Finally, in Sect. 7, we introduce network analysis and generator software supported by the SPP 1736.

#### **1.2 Notation**

A graph *G* = (*V*,*E*) models a set of objects (nodes) *V* = {*v*1,...,*vn*} and their connections *E* (edges). Throughout this chapter, we will denote the numbers of nodes and edges as *n* = |*V*|, and *m* = |*E*| respectively. A graph class is called *sparse* if *m* = *O*(*n*polylog*n*) and *dense* if *m* = Θ(*n*2).

Edges can encode a direction (i.e., *E* ⊆ *V* × *V*) or be undirected (i.e., *E* ⊆ {{*u*,*v*}|*u*,*v* ∈ *V*}). If not stated differently, we assume undirected graphs. An edge that

<sup>1</sup> In the interest of readability, "*we*" is used quite casually in this chapter. Please note that it also appears regularly in the context of work of others.

exists multiple times is called *multi-edge* and part of a *multi-graph*. A graph without multi-edges or self-loops (edges between one node) is called *simple*.

Two nodes *u* and *v* connected by an edge *e* are *neighbors* or said to be *adjacent*; the nodes *u* and *v*, in turn, are incident to the edge *e*. The number of neighbors of a given node *u* ∈ *V* is called its *degree* deg(*u*). A sequence (deg(*v*1),...,deg(*vn*)) is called a *degree sequence*. The related concept of a *degree distribution* refers to the probability distribution of the degree of a randomly sampled node (possibly in a randomly sampled graph). Many observed networks exhibit a *powerlaw* degree distribution where the probability of degree *k* is proportional to *k*−γ for some 2 < γ (and often γ < 3). A property applies *with high probability* (*w.h.p.*) if it holds with probability of at least 1−*x*<sup>−</sup>α for α≥ 1 where *x* depends on the context and is often the problem size *n* or *m*.

#### **1.3 Models of Computation**

The design of an algorithm is heavily influenced by the assumed model of computation. If not state differently we suppose the *unit-cost RAM* in which operations for controlflow, data access, and basic arithmetic are handled in constant time. For shared-memory parallel algorithms, its parallel variant PRAM is used. In a parallel context, we use the term *processing unit (PU)* to refer to an abstract machine executing a sequential algorithm (e.g., a core in a CPU or an individual processor in a distributed computer cluster). A problem is said to be *pleasingly parallel* if it consists of sufficiently many subproblems that can trivially be computed independently.

To model the cost of data transfer, the *external memory model* by Aggarwal and Vitter [1] assumes a two-level memory hierarchy. It consists of an internal memory of size *M* and an unbounded external disk which holds the algorithm's input and output. Computation is free, but is only possible on data in internal memory and therefore has to be move to and fro. Data access is block-oriented and transfers *B* data items per *I/O*. Reading and writing *N* contiguous items is referred to as *scanning* and requires scan(*N*) = Θ(*N*/*B*) I/Os. Sorting such items triggers sort(*N*) = Θ((*N*/*B*)log*M*/*B*(*N*/*B*)) I/Os and constitutes a lower-bound for most intuitively hard problems.

Analogously, the cost of communication is often a bottleneck for distributed machines consisting of interconnected processors. *Communication-agnostic* algorithms are an extreme case of communication avoidance. Each PU is only aware of its rank, the total number of PUs, and some input configuration. However, exchange of any further information during the execution of the algorithm is prohibited.

# **2 Random Graphs and the** *G*(*n*, *p*) **and** *G*(*n*,*m*) **Models**

<sup>A</sup> *random graph* is a probability distribution *<sup>P</sup>*: <sup>G</sup> <sup>→</sup> [0,1] where <sup>G</sup> is the set of all graphs. Virtually all *random graph models*<sup>2</sup> are parameterized and thus form families of probability distributions. The underlying distributions are typically specified implicitly, and often have a finite support defined by some combinatorial constraints.

<sup>2</sup> In the literature the terms *random graph* and *random graph model* are commonly used interchangeably, and may even refer to a random instance sampled from a model. We adopt the former simplification for the sake of readability.

As an example, consider the popular *G* (*n*, *p*) model introduced by E. Gilbert [20] in 1959. In its original formulation, the model's support consists of all 2*n*(*n*+1)/<sup>2</sup> undirected graphs with exactly *n* nodes. The probability distribution is given indirectly via the following sampling algorithm:

"Pick one of these graphs by the following random process. For all pairs of points [nodes] make random choices, independent of each other, whether or not to join the points of the pair by a line [edge]. Let the common probability of joining be *p*." [20]

In other words, in a random instance of *G* (*n*, *p*) any edge *e* exists independently with probability *p*. Observe that *G*(*n*,1/2) hence implies the uniform distribution of all graphs with *n* nodes. It is therefore a so-called *maximum entropy model* and sometimes even referred to as *the* random graph [5].

Erdos and Rényi [ ˝ 17] propose the related and well-known *G* (*n*,*m*) model as the uniform distribution over all undirected graphs with *n* nodes and *m* edges. The models *G* (*n*, *p*) and *G* (*n*,*m*) with *m* = *n* 2 *p* are equivalent in the limit of *n* → ∞.

Neither *G* (*n*, *p*) nor *G* (*n*,*m*) explain the non-trivial structural properties of observed networks. Since all edges are chosen (mostly) independently with identical probabilities, we do not expect the formation of any complex features. Several ways to formalize this intuition are discussed in [44 SPP]. Still, the models are commonly used to generate synthetic data, e.g., as a null-model.

# **2.1 Sampling from** *G* (*n*, *p*) **and** *G* (*n*,*m*)

Gilbert's sampling algorithm is designed to communicate the model's spirit to a human reader and, as such, is not optimized for performance. The generator thus requires Ω(*n*2) work independently of the linking probability *p* which is suboptimal for nondense graphs.

Batagelj and SPP 1736 PI Brandes [6] describe an optimal sequential generator requiring work linear in the number *m* of edges produced. The algorithm fixes a convenient order of all possible edges (i.e., a bijection π : [ *n* 2 ] → {{*u*,*v*}|*u*,*v* ∈ *V* ∧*u* = *v*}) and considers them in this sequence. Since each edge in a *G* (*n*, *p*) graph is the result of an independent Bernoulli trial, the number of "non-edges" between any two successful trials follows a geometric distribution. The generator therefore draws a random geometric variate, jumps over that many non-edges, writes out the next edge, and repeats until all possible edges have been considered.

Since all edges are drawn independently, the generator can be parallelized by partitioning the sequence of possible edges into independent sub-problems of roughly equal size. Later, Bringmann and Friedrich [9] give an exact variant of the algorithm that does not require real-valued arithmetic to sample the skip distances.

Sampling from *G* (*n*,*m*) is more challenging than *G* (*n*, *p*) since faithful *G* (*n*,*m*) generators can not assume independent trials. This is due to the fact that partially sampled edges and non-edges affect the probability distribution of the remaining candidates. While Batagelj and Brandes remark that their *G* (*n*, *p*) generator can be extended to *G* (*n*,*m*) by modifying the skip distance distribution accordingly, they continue to develop a more efficient alternative requiring work linear in the number of edges produced [6].

In the following, we however focus on a parallel approach by Funke et al. [18 SPP] and showcase general divide-and-conquer techniques used to yield communicationagnostic generators. The resulting generator is a variant of a parallel sampling algorithm [47 SPP] for the related problem of randomly selecting *m* distinct elements from a finite universe (i.e., sampling without replacement).

For simplicity's sake, we only consider the directed variant of *G* (*n*,*m*). <sup>3</sup> In order to parallelize, we partition the set of nodes *V* into disjoint subsets *V*1,...,*Vp* of roughly equal size. Then, processing unit *i* is tasked to produce the *mi* out-going edges of nodes in *Vi*. By definition of *G* (*n*,*m*), we require that ∑*<sup>i</sup> mi* = *m*. Observe that this is the only dependency between subproblems. Thus, if *mi* is known a priori, PU *i* can work independently.

Consequently, we need to find a communication-agnostic way to agree on a consistent and randomly chosen **m** = (*m*1,...,*mP*) where each PU only needs to know its own value *mi*. The vector **m** follows a multinomial hypergeometric distribution where the number of "positive instances" for the *i*-th entry are given by the number *n* · |*Vi*| of potential edges processed by PU *i*. Under the assumption that the number *P* of PUs satisfies *P* = *O*(*n*/log*n*) the values of *mi* are sufficiently concentrated to bound the complexity of the previous local sampling to *O*((*n*+*m*)/*P*) w.h.p..

A traditional distributed generator may sample **m** on a central PU and then broadcast the values—this is, however, not possible in a communication-agnostic setting since it incurs a communication volume Ω(*P*). Alternatively, each PU can independently sample **m** with pseudo-random number generators that use a common seed value. This approach requires expected time Θ(*P*) and, thus, dominates the total runtime for *P* = ω( √*m*).

Thus, we rather follow a divide-and-conquer approach which works for various distributions and is also used in Sect. 6.3. Roughly speaking, each *mi* corresponds to a leaf in a binary tree of depth *O*(log*P*). At each inner node, we draw a random variate *x* from an appropriately parametrized hypergeometric distribution and interpret *x* as the number of edges to be produced in the left subtree. Each PU follows its unique path from the root to the *i*-th leaf to sample its own value of *mi*. To achieve consistent values, the sampling at each inner node is carried out using a pseudo-random number generator whose seed is deterministically derived from a unique node index.

The authors show that combining these ideas yields a communication-agnostic generator with a runtime complexity of *O*((*n*+*m*)/*P*+log*P*) w.h.p..

# **3 Preferential Attachment**

Barabási and Albert [4] propose a simple stochastic process to explain the emergence of scale-free networks and show that two ingredients, namely growth and selection bias, suffice to yield networks with powerlaw degree distributions.<sup>4</sup>

<sup>3</sup> The undirected [18 SPP] variant only differs in the partitioning of the parallel subproblems.

<sup>4</sup> Earlier, Price [50] proposed a similar process inspired by Pólya urns [16]. The author applies it to citation networks with a known powerlaw in-degree distribution [46]. The more widespread BAmodel is sometimes interpreted as a special case of Price's model.

At its core, their BA model relies on *preferential attachment*, a positive feedback loop in dynamic systems where selecting an item at one point in time increases the probability of selecting it again in the future. It is proverbially summarized as "the rich get richer".

Based on this idea, the authors describe the following random graph. Starting with an arbitrary seed graph *G*<sup>0</sup> with *n*<sup>0</sup> vertices and *m*<sup>0</sup> edges, we iteratively add *n* − *n*<sup>0</sup> nodes—one node at a time. For each new node, we choose *d* neighbors at random where the probability to select node *v* is proportional to the degree of *v* at that time.

The main algorithmic challenge of BA lies in this dynamic weighted sampling. Depending on the assumed model of computation, quite different solutions are available. Batagelj and Brandes [6] observe that each node with degree *k* appears exactly *k*-times in the edge list produced so far. Therefore, the underlying dynamic weighted sampling problem can be reduced to uniformly selecting entries from the edge list, leading to the linear-time generator BB-BA.

As BB-BA requires unstructured I/Os, it cannot efficiently produce graphs that do not fit into main memory. Meyer and Penschuck [36 SPP] introduce TFP-BA and MP-BA, the first two I/O-efficient sampling approaches for random graph models based on preferential attachment. The authors initially focus on BA graphs to demonstrate the techniques and subsequently discuss additional features such as seed graphs exceeding main memory, nodes with inhomogenous initial degrees, the inclusion of uniform node sampling, directed graphs, and edges between two randomly chosen nodes.


In order to select a neighbor, MP-BA first has to sample a leaf according to the current degree distribution and then increment the leaf's weight to account for the newly gained edge. The key insight is that we can do both in a single top-down traversal from the root to the sampled leaf. This allows us to combine the queries for sampling and updating into a single operation and, in turn, to coalesce queries into batches. MP-BA requires *O*(sort(*n*<sup>0</sup> +*m*)) I/Os, where *n*<sup>0</sup> is the number of nodes in the seed graph and *m* is the number of edges produced.

The algorithm uses two forms of parallelism: firstly, *T* is cut at a certain depth to process the subtrees rooted there pleasingly parallel. In order to handle the high volume of requests near *T*'s root, a dedicated PRAM algorithm processes multiple requests to the same tree node in parallel. MP-BA's implementation executes the latter part on a GPU for maximal throughput.

Sanders and Schulz [48 SPP] describe CA-BA, a communication-agnostic generator for distributed-memory parallelism. Their algorithm builds on top of BB-BA and uses pseudo-randomization to avoid all lookups to edges generated. By doing so, several PUs can work on the problem without exchanging information other than an initial broadcast of the seed graph and a few parameters. In contrast to the original algorithm, CA-BA does not maintain an edge list to sample from explicitly. To simplify the description, we still presume its existence as a concept for addressing.

In order to add an edge, the generator needs to place the indices of the two incident nodes into the edge list. Recall that each generated edge consists of a newly introduced node and a randomly selected neighbor. By convention, we store the former at even positions of the edge list, and the latter at odd positions. Since by definition of the BA model, each newly introduced node is initially incident to exactly *d* edges, all entries at even positions follow from a simple index transformation.

Sampling random neighbors involves a shared random hash function *h*(·) with the property that *h*(*i*) < *i*. Then, in order to choose the node index of the random neighbor to be written to the edge list's *i*-th position, we conceptually copy the value from position *j* = *h*(*i*). To do so, we distinguish three cases:


The first two cases imply constant work on a unit-cost RAM. Since we assume *h*(·) to be a random function, the first two cases are chosen with probability of at least 1/2. Thus, the recursion of the last case has an expected depth of at most 2 and is *O*(log*m*) with high probability. Assuming *h*(·) can be evaluated in constant time, CA-BA therefore requires expected linear work.

# **4 R-MAT**

R-MAT [15] graphs are a well-accepted network model which is especially known for its use in the Graph500 benchmark [38]. The model is defined for graphs on 2*<sup>k</sup>* nodes and *m* edges. To sample an edge, we recursively subdivide the adjacency matrix into four quadrants, assign them probabilities *pa* + *pb* + *pc* + *pd* = 1 provided as model parameters, and randomly select one. We repeat this *k* times until we reach a matrix of size 1×1 which corresponds to the edge. Depending on the model, we either allow multi-edges, or reject and resample to avoid duplicates. Undirected graphs are possible and typically imply additional symmetry constraints on the quadrant probabilities. For certain sets of parameters, the model exhibits similarities to observed networks such a powerlaw degree distribution [34].

Following the recursive definition, there exists a bijection between each possible edge and the set of words Σ*<sup>k</sup>* over Σ <sup>=</sup> {a,b,c,d} where each *<sup>x</sup>* <sup>∈</sup> Σ represents the quadrant chosen. A naive R-MAT generator explicitly samples the *k* characters, one after another, and thus requires Ω(log*n*) work per edge.

Hübschle-Schneider and Sanders [27 SPP] propose a communication-agnostic scheme that instead samples edges in constant time under the reasonable assumption that *m* = Ω(*n*). The algorithms performs a preprocessing step to construct an urn which contains *n*α path fragments (for some α < 1) weighted by their probabilities in time *O*(*n*), e.g., by considering all words Σ of fixed length - = log2 √*n* = *k*/2.

To draw an edge we sample *k*/- = *O*(1) fragments. We then concatenate them using bit-parallel shifting and masking operations available in virtually all modern computers. Both steps require only constant time per edge.

# **5 Simple Graphs from Prescribed Degree Sequence**

The sampling of random graphs matching a prescribed degree sequence is a common task in network analysis. Its various applications range from to the construction of nullmodels (e.g., Chapter 3) to use-cases as building blocks in graph generators. Instances of the latter are the popular LFR benchmark [28] or the derived ReCon [51 SPP] model to generate scaled-replicas of an input graph.

The computational cost of this approach heavily depends on the exact formulation of the model. Two models with linear work sampling algorithms are the Chung-Lu (CL) model and the Configuration Model (CM). The CL model produces the prescribed degree sequence only in expectation (see [44 SPP] for details). The CM, on the other hand, exactly matches the prescribed degree sequence but permits self-loops and multi-edges. These parallel edges affect the uniformity of the model [39, p. 436] and are inappropriate for certain applications; however, erasing them may lead to significant changes in topology [49 SPP]. In the following, we focus on simple graphs (i.e., without self-loops or multi-edges) matching a prescribed degree sequence exactly. Several generators and models for such graphs were considered within the SPP 1736.

#### **5.1 The Edge Switching Markov Chain Model**

The Fixed-Degree-Sequence-Model (FDSM) is a common solution to obtain simple graphs from a prescribed degree sequence. It first manifests a biased deterministic graph (e.g., using the HAVEL-HAKIMI algorithm [23,26]) and then uses an Edge Switching (ES) Markov chain process [21] to perturb the graph. In each step, the process selects two edges uniformly at random and exchanges their incident nodes—by doing so the degrees of all nodes involved do not change. If a step were to result in a self-loop or multi-edge, it is rejected without replacement. Despite intensive research, it remains an open problem to find *practical* upper bounds on the Markov chain's mixing time; i.e., the number of steps required to obtain a uniform sample. In practice, a small multiple of the number of edges typically suffices (cf. Chapter 3).

The main issue when implementing ES is the large number of unstructured accesses to memory; for each switch it is necessary to identify the involved nodes, check whether the updated edges already exist, and finally to write out the updates.

Hamann et al. [25 SPP] describe EM-LFR, an I/O-efficient pipeline to sample large instances from the LFR model. From an algorithmic point of view, two central parts of the pipeline are EM-HH and EM-ES which together implement FDSM. EM-HH is designed to avoid memory accesses as best as possible especially for graphs with powerlaw distributions. EM-ES, on the other hand, batches Θ(*m*) individual swaps and processes them out-of-order without changing the outcome; due to the large number of swaps in each batch, we can amortize the I/O volume and stream through the whole graph a constant number of times rather than executing Θ(*m*) more expensive unstructured accesses.

Later, [24 SPP] propose a modification of the FDSM model and provide empirical evidence of faster mixing. The previous combination of EM-HH followed by EM-ES starts with a highly biased simple graph. The novel EM-CM/ES takes another route: It starts with a random but non-simple graph and switches edges until a simple random graph is obtained. It uses an I/O-efficient generator for the Configuration Model and a variant of EM-ES which accepts non-simple inputs without increasing its I/O complexity. The modified algorithm executes all switches that neither increase the multiplicity of a given edge nor introduce self-loops. Non-simple edges are also switched more frequently than legal edges to accelerate the repair phase. Observe, however, that it does not suffice to rewire non-simple edges using the presented variant of ES as it produces a biased sample [2,3]. Instead, additional ESsteps are necessary.

Brugger et al. [12 SPP] implement ES in hardware (see Chapter 4 for details). Their design maintains the graph in a hybrid data structure combining an adjacency list to efficiently sample edges and an adjacency matrix for fast edge existence queries. Then, the authors investigate two cases:


#### **5.2 Curveball**

Curveball (CB) [52] is a more recent process but structurally similar to ES; instead of selecting random edges, CB selects two random nodes *u* = *v*, and *trades* their neighborhoods as follows. CB begins by freezing all edges that either connect *u* and *v* themselves or link to neighbors which *u* and *v* have in common. Then, the remaining neighbors are randomly shuffled while maintaining the degrees of *u* and *v*. A single CB trade can therefore inflict "more change" to a graph than a single edge switch; depending on the processed graph, a state in CB's Markov chain may have up to 2Θ(*n*) neighbors while the degrees in ES's chain are bounded by *O*(*n*4) [13]. Empirical data suggests that fewer trades are necessary to mix a graph (though each trade may require more work).

CB exposes more data locality than ES since all information required to carry out a trade is contained in the two neighborhoods. This is in contrast to ES, which requires additional unstructured reads to prevent a switch from introducing multi-edges. Note, however, that an undirected edge is classically stored twice—once for each endpoint. In this scenario, frequent unstructured updates are necessary and negate the previously mentioned locality benefits.

The I/O-efficient EM-CB algorithm [14 SPP] thus relies on a dynamic data structure and assigns each edge only to the endpoint that is traded next. EM-CB uses the external-memory technique Time Forward Processing (TFP, see [35]) to ensure that the complete neighborhood of a node is available when needed.

The algorithm works in batches. At the beginning of each batch, it samples the node pairs to be traded within the batch and organizes them in dedicated indices. These auxiliary data structures are used to address the TFP messages and to determine which endpoint of an edge will be traded first. EM-CB requires *O*(*r*[sort(*n*) +sort(*m*)]) I/Os to carry out *r* global trades (see below).

Carstens et al. [14 SPP] generalize Global Curveball (G-CB) to undirected graphs. An *undirected global trade* is a sequence of *n*/2 single trades such that the neighborhood of each node is traded at most once. They show that the process converges to a uniform distribution over the set of all graphs and give empirical evidence of its superior performance compared to CB.

Since each node participates once<sup>5</sup> in a global trade, we can interpret a global trade as a random permutation of nodes where we trade pairwise adjacent nodes. The authors then propose an algorithm that eliminates the auxiliary data structures by maintaining the permutation implicitly using a collision-free (on the relevant domain) and invertible hash function, and finally give a parallel version of it.

# **6 Geometrically Embedded Random Graphs**

Random Hyperbolic Graphs (RHGs) are a popular network model which naturally exhibits many features commonly observed in complex networks. RHG assigns each node a position on a two-dimensional hyperbolic disk of radius *R*. These positions are conveniently expressed in polar coordinates where each point is located in terms of its distance *r* (radius) to the disk's center and an angular coordinate θ.

In the so-called Threshold RHG [22], we connect all pairs of points (*ri*,θ*<sup>i</sup>*) and (*rj*,θ*<sup>j</sup>*) with *i* = *j* whose hyperbolic distance *d*(*pi*, *pj*) is smaller than *R*, where

$$\cosh(d(p\_i, p\_j)) = \cosh(r\_i)\cosh(r\_j) - \sinh(r\_i)\sinh(r\_j)\cos(\theta\_l - \theta\_j). \tag{1}$$

Thus, the hyperbolic distance is a function of the relative and absolute positions of both points; the closer a point is to the disk's center, the more neighbors it is expected to have. We obtain a powerlaw degree distribution with a controllable exponent by choosing an appropriate radial density for the randomly placed points.

<sup>5</sup> For simplicity, we assume here that *n* is even.

Binomial RHG extends Threshold RHG by adding a positive *temperature* parameter *T* that affects the local cohesion. In the binomial variant, each pair of nodes *pi* = *pj* is independently connected by an edge with probability *pT* (*d*(*pi*, *pj*)) defined as follows:

$$p\_T(d) = \left[ \exp\left(\frac{d-R}{2T}\right) + 1\right]^{-1} \tag{2}$$

Binomial RHG contains Threshold RHG as *pT* becomes a step function for *T* → 0. Looz and Meyerhenke [31 SPP] propose an extension of the RHG model to generate dynamic graph data sets: their model adds movement of nodes which in turn translates to a stream of edge insertions and deletions.

#### **6.1 Efficient Generators Based on Geometric Data Structures**

A naive RHG generator that checks each node pair for an edge requires Ω(*n*2) work and little parallel depth6. All efficient generators we are aware of reduce the computational complexity in a two step process: they cheaply identify a set of edge candidates (i.e., a super-set of the true result), and then filter the candidates more carefully. The identification typically exploits geometrical or stochastic arguments, while the filtering process tends to involve costly per edge distances computations.

All geometric generators discussed in the remainder of this chapter use one of two geometric partitioning schemes, namely a quad-tree or a band structure.

– Looz et al. [32 SPP] describe NKQUAD, the first sub-quadratic work RHG generator. NKQUAD is based on a polar quad-tree which recursively subdivides the space into four quadrants each (i.e., each inner tree-node introduces two cuts, one in the angular and one in radial dimension, respectively). The generator then iterates over all nodes and computes for each *v* ∈ *V* the neighbor candidates *Cv*. The set *Cv* consists of all nodes in quad-tree leaf cells which intersect the hyperbolic circle of radius *R* around *v*. The identification of such leafs is simplified by working in the Poincare projection which translates hyperbolic circles into (radially shifted) Euclidean circles. The authors show that such a query examines *O*( <sup>√</sup>*n*+|*Cv*|) leafs w.h.p., leading to total work of *O*((*n*3/<sup>2</sup> +*m*)log*n*) w.h.p..

Later, Looz and Meyerhenke [30 SPP] generalize the data structure and extend the generator to Binomial RHG while maintaining the asymptotic complexity. The efficient sampling of low-probability edges is implemented by bounding the probability to connect to any edge within a leaf from above. These bounds are used to carry out geometric jumps (cf. Sect. 2) followed by rejection sampling to account for the over-estimation.

– Looz et al. [33 SPP] improve NKQUAD by proposing NKBAND featuring a novel partitioning scheme. NKBAND covers the hyperbolic disk with Θ(log*n*) disjoint concentrical bands where each band is maintained as an array of points sorted by their angles. To find the neighbor candidates of a node *v* in band *bi*, the algorithms considers *bi* and all bands containing larger radii. For each such band *bj*, the smallest and largest angular coordinate of a potential neighbor of *v* in *bj* is computed; then

<sup>6</sup> Dependencies may arise from the output format, e.g., from a need for compaction.

two binary search yield the left- and right-most candidates in the sorted array. By doing so, the authors effectively over-estimate the upper half of the hyperbolic circle around node *v* by a discrete stack of shrinking band-segments. The generator has an empirical runtime of *O*(*n*log*n*+*m*). Later, Looz [29 SPP] extends NKBAND to Binomial RHGusing ideas similarly to the generalization described for NKQUAD.

# **6.2 A Fast and Memory-Efficient Streaming Generator for RHG**

As the geometric data structures discussed for NKQUAD, NKBAND, and HYPERGIRGS have a large memory footprint that can render them unsuitable for accelerator hardware with a small dedicated memory, [42 SPP] presents HYPERGEN, a streaming generator for Threshold RHGs which instead samples the points on demand. The generator requires *O*([*n*1<sup>−</sup>α *d*¯α +log*n*]log*n*) words of memory w.h.p.. For realistic average degrees *d*¯ = *o*(*n*/log1/α (*n*)) this is a significant asymptotic reduction over classical approaches.

HYPERGEN executes a sweep-line algorithm and stores the set of nodes that may still find neighbors in its sweep-line state; we refer to them as *candidates*. Roughly speaking, the algorithm randomly samples points with non-decreasing angular coordinates.<sup>7</sup> For each new point, the algorithm identifies all sufficiently close candidates and emits edges to them. The generator then marks the point a candidate itself and advances the sweep-line. HYPERGEN stops the sweep-line at additional points, e.g., to prune candidates whose distances to the sweep-line are so large that they cannot find neighbors anymore.

To manage the computational cost of maintaining the sweep state, HYPERGEN includes conservative approximations that do not infringe on the generator's faithful reproduction of RHGs. They exploit the distribution of points as well as properties of the hyperbolic distance function. The majority of points can be quickly pruned from the algorithm's state. In contrast, the few points that have small radii stay candidates for a significantly longer period of time. To accommodate the different requirements, HYPERGEN partitions the hyperbolic disk into Θ(log*n*) concentrical bands. Each band has its own sweep-line and state which remain synchronized with the states of its adjacent bands.

Observe that, due to the angular periodicity of the hyperbolic disk, points sampled late (i.e., with angles near 2π) can be adjacent to points discovered and pruned much earlier. HYPERGEN accounts for this by restarting the sampling process until all candidates of the first phase are processed. It exploits pseudorandomness to obtain consistent point coordinates in both phases.

Parallelization is possible by splitting the disk into segments of equal size. Some care has to be taken to manage the dependencies near the segments' borders. HYPER-GEN also significantly accelerates the frequent distance computations by preparing auxiliary values per point. This removes all transcendental functions (here sinh, cosh, and cos) from Eq. (1). Refined versions of these techniques carry over to Sects. 6.3 and 6.4.

The implementation of HYPERGEN is designed with SIMD (Single-Instruction-Multiple-Data) in mind and is explicitly vectorized. It uses SIMD instructions to com-

<sup>7</sup> This is an over-simplification of the sweep-line's behavior (cf. [42 SPP]).

pute eight hyperbolic distances simultaneously (which is only possible because we first removed the aforementioned transcendental functions).

#### **6.3 Communication-Agnostic Generators for RHG**

Funke et al. [19 SPP] present RHG, a communication-agnostic generator for Threshold RHG. The generators RHG and HYPERGEN were developed independently at roughly the same time, and share ideas to sample specific subsections of the hyperbolic disk using pseudorandomization. While HYPERGEN uses a monotonous sweep-like motion optimized for memory usage, RHG uses less structured queries. These "random" queries are answered using a fine-grained partitioning of the hyperbolic space which ingeniously allows random access to any cell (the geometry is similar to the one discussed in Sect. 6.1).

For huge graph instance, the number of nodes may be too large to sample —let alone store— all nodes on every distributed machine. Fortunately, a key property of relevant RHG graphs is that most nodes only have a very local neighborhood, i.e., a hyperbolic circle around each node suffices to compute all its links. Observe that many of these subsets overlap due to common edges. In general, there is no balanced mapping of nodes to processing units without overlaps. Thus, any two PUs with overlapping subsets have to have a consistent view of the underlying region of hyperbolic space.

We achieve this by partitioning the hyperbolic space into *k* cells. Then, the following process reproducibly samples points within a cell. First, a hash function *f* is used to seed a pseudorandom number generator with the value *f*(*i*). For each cell *i*, we seed a pseudorandom generator with a value deterministically derived from the cell's index *i* and, subsequently, use the generator to sample the *ni* points contained within the cell. By construction, this process yields consistent results even if executed by multiple independent processing units.

The only information missing is the number *ni* of points in cell *i*. The vector **N** = (*n*1,...,*nk*) follows a multinomial hypergeometric distribution due to the side condition that exactly *n* points need to be scattered in total, i.e., ∑*<sup>i</sup> ni* = *n*. All PUs obtain consistent values for **N** using common seeds for their pseudorandom generators analogously to the divide-and-conquer approach in Sect. 2.1.

In [18 SPP], this techniques is combined with HYPERGEN (see Sect. 6.2) yielding the communication-agnostic sweep-line generator SRHG which consistently outperforms RHG. We demonstrate its scalability to up to 32 768 cores and produce a graph with *n* = 239 nodes in less than a minute.

#### **6.4 GIRG-Based Generator**

Bringmann et al. propose Geometric Inhomogenous Random Graphs as a flexible and simple model, that asymptotically contains RHG [11]. Roughly speaking, the model embeds a graph into an *d*-dimensional torus and uses node weights to control the degree sequence similarly to the Chung-Lu model. The authors also give an expected linear time sampling algorithm for GIRGs [10] which we engineer adapt<sup>8</sup> it to Binomial RHGs

<sup>8</sup> Bringmann et al. already discuss the applicability to RHG. The models are however not identical [7 SPP], and HYPERGIRGS closes this gap.

in [7 SPP]. We refer to our algorithms as GIRGS and HYPERGIRGS, respectively. To the best of our knowledge, GIRGS is the first practically efficient generator for the GIRG model. Here, we focus on RHGs since the algorithmic treatment of both models is very similar.

HYPERGIRGS first samples all points and builds a data structure that can be interpreted as a polar quad-tree. While the structure is similar to the previous state-of-the-art generator NKQUAD (see Sect. 6.1), differences in details result in a polynomial gap in their running times. In the following, we refer to nodes of the quad-tree as *tree-nodes* (to distinguish them from the hyperbolic nodes contained).

Bringmann et al. propose the following neighborhood search which is adapted by HYPERGIRGS. For simplicity, we initially restrict ourselves to Threshold RHGs. The generator enumerates all pairs of tree-nodes that may contain point pairs sufficiently close to imply an edge. This is done in a pessimistic and oblivious fashion, i.e., without considering the actual points represented by the tree-nodes. HYPERGIRGS then emits edges by testing all point pairs contained in each previously enumerated pair of treenodes. To avoid asymptotically significant overheads, the algorithm pairs tree-nodes as high up in the quad-tree as possible without adding unintended distance computations.

The quad-tree needs to support efficient random access to all points contained within any tree-node at any depth. Similarly to [10], HYPERGIRGS achieves this using z-order space-filling curves [41] to map the tree to memory. This choice allows us to efficiently build and query the quad-tree using Morton codes [37].

In case of Binomial RHGs with *T* > 0, any node pair has a positive (yet mostly negligible) probability *pT* (*d*) to be connected. HYPERGIRGS therefore has to consider all tree-node pairs—even those with a tiny connection probability. In the latter case, the connection probability is bounded from below. Then, we use geometric jumps followed by rejection sampling to prune the search space. The authors also engineer an exact look-up table-based sampling scheme to reduce the evaluation of transcendental functions during the computation of linking probabilities *pT* (*d*).

HYPERGIRGS processes the tree-node pairs pleasingly parallel. As a special feature, its implementation guarantees reproducibility in the sense that two runs with the same set of parameters and seed values output the same set of edges (though not necessarily in the same order). At the time of writing, the implementation of HYPERGIRGS is the fastest sequential RHGgenerator and competitive for shared-memory parallelism.

# **7 Software Packages**

From a practical point of view, it is crucial that a generator interacts well with the software used to analyze the emitted graphs. A common choice is to write the produced graph into a file which then can be processed by a tool of choice. There are, however, notable drawbacks of this approach; for one, there are a plethora of file formats which may be incompatible. Also reading and writing files can have surprisingly high overheads (e.g., [42 SPP]).

The network analysis framework NetworKit (partially supported by the SPP 1736) includes generators for all network models that are discussed in depth in this chapter. As detailed in Chapter 1, this software package combines various types of graph algorithms efficiently implemented in C++ with an easy to use Python interface. The tight interaction between network generation and analysis promises a fast and convenient processing pipelines.

KaGen is a graph generator suite for distributed computing and contains a number of communication-agnostic generators [18 SPP]. The suite includes generators for the following models accessible via a common interface *G* (*n*, *p*), *G* (*n*,*m*), Kronecker Graph, Random Geometric Graph, Random Delaunay Triangulation, Barabási-Albert, and Threshold RHG.

**Acknowledgements.** The authors thank Mario Holldack and Hung Tran for valueable discussions and their insightful comments.

# **References**

	- 8. Bollobás, B.: Random Graphs, 2nd edn. Cambridge Studies in Advanced Mathematics, vol. 73. Cambridge University Press, Cambridge (2011). https://doi.org/10.1017/ CBO9780511814068
	- 9. Bringmann, K., Friedrich, T.: Exact and efficient generation of geometric random variates and random graphs. In: Fomin, F.V., Freivalds, R., Kwiatkowska, M., Peleg, D. (eds.) ICALP 2013. LNCS, vol. 7965, pp. 267–278. Springer, Heidelberg (2013). https:// doi.org/10.1007/978-3-642-39206-1\_23
	- 10. Bringmann, K., Keusch, R., Lengler, J.: Sampling geometric inhomogeneous random graphs in linear time. In: ESA, pp. 20:1–20:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017). https://doi.org/10.4230/LIPIcs.ESA.2017.20
	- 11. Bringmann, K., Keusch, R., Lengler, J.: Geometric inhomogeneous random graphs. Theor. Comput. Sci. **760**, 35–54 (2019). https://doi.org/10.1016/j.tcs.2018.08.014
	- 13. Carstens, C.J., Berger, A., Strona, G.: Curveball: a new generation of sampling algorithms for graphs with fixed degree sequence. CoRR abs/1609.05137 (2016)
	- 15. Chakrabarti, D., Zhan, Y., Faloutsos, C.: R-MAT: a recursive model for graph mining. In: SDM, pp. 442–446. SIAM (2004). https://doi.org/10.1137/1.9781611972740.43
	- 16. Eggenberger, F., Pólya, G.: Über die Statistik verketteter Vorgänge. ZAMM-J. Appl. Math. Mech./Zeitschrift für Angewandte Mathematik und Mechanik **3**(4), 279–289 (1923)
	- 17. Erdos, P., Rényi, A.: On random graphs I. Publicationes Mathematicae Debrecen (1959) ˝
	- 20. Gilbert, E.N.: Random graphs. Ann. Math. Stat. **30**(4), 1141–1144 (1959). https://doi. org/10.1214/aoms/1177706098
	- 21. Gkantsidis, C., Mihail, M., Zegura, E.W.: The Markov Chain simulation method for generating connected power law random graphs. In: Workshop on Algorithm Engineering and Experiments, pp. 16–25. Society for Industrial and App. Math. SIAM (2003)
	- 22. Gugelmann, L., Panagiotou, K., Peter, U.: Random hyperbolic graphs: degree sequence and clustering - (extended abstract). In: Czumaj, A., Mehlhorn, K., Pitts, A., Wattenhofer, R. (eds.) ICALP 2012. LNCS, vol. 7392, pp. 573–585. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31585-5\_51
	- 23. Hakimi, S.L.: On realizability of a set of integers as degrees of the vertices of a linear graph. I. J. Soc. Ind. App. Math. **10**(3), 496–506 (1962). https://doi.org/10.1137/ 0110037
	- 26. Havel, V.: Poznámka o existenci konecných graf˚ ˇ u. Casopis pro p ˇ estování matematiky ˇ **080**(4), 477–480 (1955)
	- 28. Lancichinetti, A., Fortunato, S.: Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities. Phys. Rev. E **80**(1), 016118 (2009). https://doi.org/10.1103/physreve.80.016118
	- 34. Mahdian, M., Xu, Y.: Stochastic kronecker graphs. In: Bonato, A., Chung, F.R.K. (eds.) WAW 2007. LNCS, vol. 4863, pp. 179–186. Springer, Heidelberg (2007). https://doi. org/10.1007/978-3-540-77004-6\_14
	- 35. Maheshwari, A., Zeh, N.: A survey of techniques for designing I/O-efficient algorithms. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 36–61. Springer, Heidelberg (2003). https://doi.org/10.1007/3- 540-36574-5\_3
	- 37. Morton, G.M.: A comp. oriented geodetic data base and a new technique in file sequencing. Technical report. Int. Business Machines Company, New York (1966). https://domino.research.ibm.com/library/cyberdig.nsf/0/ 0dabf9473b9c86d48525779800566a39?OpenDocument
	- 38. Murphy, R.C., Wheeler, K.B., Barrett, B.W., Ang, J.A.: Introducing the graph 500. Cray Users Group (CUG) **19**, 45–74 (2010)
	- 39. Newman, M.E.J.: Networks: An Introduction. Oxford University Press, Oxford (2010). https://doi.org/10.1093/ACPROF:OSO/9780199206650.001.0001
	- 40. Newman, M.E.J., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E **69**(026113), 1–16 (2004). http://link.aps.org/abstract/PRE/v69/e026113
	- 41. Orenstein, J.A., Merrett, T.H.: A class of data structures for associative searching. In: PODS, pp. 181–190. ACM (1984). https://doi.org/10.1145/588011.588037
	- 45. Popper, K.: The Logic of Scientific Discovery. Hutchinson, London (1959)
	- 46. Price, D.J.D.S.: Networks of scientific papers. Science **149**(3683), 510–515 (1965). http://www.jstor.org/stable/1716232
	- 50. de Solla Price, D.J.: A general theory of bibliometric and other cumulative advantage processes. J. Am. Soc. Inf. Sci. **27**(5), 292–306 (1976). https://doi.org/10.1002/ asi.4630270505
	- 52. Strona, G., Nappo, D., Boccacci, F., Fattorini, S., San-Miguel-Ayanz, J.: A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals. Nat. Commun. **5**(1), 1–9 (2014). https://doi.org/10.1038/ncomms5114

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Increasing the Sampling Efficiency for the Link Assessment Problem**

André Chinazzo(B) , Christian De Schryver, Katharina Zweig, and Norbert Wehn

> TU Kaiserslautern, Kaiserslautern, Germany {chinazzo,schryver,wehn}@eit.uni-kl.de, zweig@cs.uni-kl.de

**Abstract.** Complex graphs are at the heart of today's big data challenges like recommendation systems, customer behavior modeling, or incident detection systems. One reoccurring task in these fields is the extraction of network motifs, which are subgraphs that are reoccurring and statistically significant. To assess the statistical significance of their occurrence, the observed values in the real network need to be compared to their expected value in a random graph model.

In this chapter, we focus on the so-called Link Assessment (LA) problem, in particular for bipartite networks. Lacking closed-form solutions, we require stochastic Monte Carlo approaches that raise the challenge of finding appropriate metrics for quantifying the quality of results (QoR) together with suitable heuristics that stop the computation process if no further increase in quality is expected. We provide investigation results for three quality metrics and show that observing the right metrics reveals so-called *phase transitions* that can be used as a reliable basis for such heuristics. Finally, we propose a heuristic that has been evaluated with real-word datasets, providing a speedup of 15*.*4× over previous approaches.

**Keywords:** Link Assessment · Edge switching · Curveball · Random graphs

# **1 Introduction**

The *data deluge* phenomenon is ever more present. We, as a society, generate and store far more data than what we can make use of right now [7]. Among the reasons for that are: 1. new approaches to acquire data, ranging from internet traffic recordings to high throughput DNA sequencing; and 2. the reduction in price per bit of data storage technologies, which motivates companies and researchers to be less selective about what data to be stored. These are by no means disadvantages over past methodologies, instead, they open new possibilities for data analysis that require methods that are more efficient and more robust against noise.

Complex network analysis is a tool-set of methods commonly used to extract information from large amounts of data, as long as the data can be meaningfully represented as a network. One popular method is the so-called *Link Assessment (LA)*, whose goal is to refine the data based on the principle of *structural similarity* (or *homophily*), i.e., entities that are alike tend to share a large proportion of their neighbors. Although the assumption of homophily is most common in social network analysis, mainly unipartite networks, it has been shown useful in a large range of contexts, including bipartite networks (such as a user rating / movie network) [24 SPP].

For unipartite networks, such as a protein-protein interaction database [12] or social networks [24 SPP], the LA may serve as a data-cleansing method, evaluating whether each existing link is likely to be a true positive and whether each non-existing link is likely to be a true-negative. This should not be confused with the link *prediction* problem, which is already well-researched [17], but poses a slightly different question: Given a snapshot of a network at time *t*, which of the yet unconnected node pairs are predicted to be connected at *t* +1?

For bipartite networks, such as genes that are associated with diseases [11] or products that are bought by costumers [10], the LA is a systematic way of projecting such networks to one of their sides [27]. This so-called *one-mode projection* transforms a bipartite network into a unipartite one by connecting the nodes on one of the sides based on their connections to the other side, while the nodes of the other side are discarded (see Sect. 2). Since most methods and tools for network analysis focus on general graphs, the one-mode projection of bipartite networks is a particularly useful pre-processing step to their analysis [27]. In this chapter, therefore, we focus on the LA for bipartite networks.

The LA is closely related to a vast body of research that includes the link prediction [17,20], recommendation systems [2], and node similarity in complex network analysis [16,18]. Known approaches for such problems can be divided into *supervised* and *unsupervised learning*. Supervised learning approaches require a ground truth, i.e., a subset of the network whose links or labels are known to be correct. In general, these ground truths, or training sets, are manually annotated and therefore are often the bottleneck in the data-mining pipeline [26]. Moreover, the ground truths are often split into a training set and a test set, where the test set is used to estimate the quality of results (QoR). Unsupervised learning methods, on the other hand, require no ground truth, instead, they rely only on the structure (or other properties) of the data itself. Thus, in many cases, the QoR cannot be directly estimated. In conclusion, these methods should only be applied to specific types of data set for which their robustness has already been validated.

As stated earlier, the LA is an unsupervised method that assumes the relationships in the network to adhere to the notion of homophily, where alike nodes tend to have a larger common neighborhoods than one would expected from merely their degrees in a randomly constructed graph. In practice, the LA is based on Markov chain Monte Carlo (MCMC) methods that generate a large set of such random graphs. These MCMC methods are known to eventually converge, but their parameters are unknown. Whenever a ground truth is available, the QoR is assessed by, for e.g., the ratio of correctly identified pairs of alike nodes over the total number of pairs listed in the ground truth, i.e., the *PPVk* (see Sect. 2.1). Else, one must ensure that the MCMC has converged.

In this chapter we summarize specific aspects for creating an LA problem solver, give insights into available metrics for measuring the QoR, and propose an appropriate heuristic that can speed up the run time by a factor of 15*.*4× compared to a conservative approach.

# **2 Link Assessment Based on z\***

Several node similarity measures have been proposed by different scientific communities, such as the *Jaccard index*, the *Pearson correlation coefficient*, or the *hypergeom* [12]. In [24 SPP], we have introduced a new similarity measure, the z\* that has shown to be the most robust one across a range of datasets from protein-protein interactions to movie ratings to social network.

Given a bipartite graph *G*((*Vl,Vr*)*,E*) with vertices *Vl* and *Vr* and edges E, we define *coocc*(*u,v*) as the number of co-occurring neighbors of nodes *u* and *v*. Most node similarity measures inherently depend on this quantity, but differ in how they are normalized based on the structure of the network. The similarity scores between nodes of the sideof-interest, say *Vl*, are the basis for the one-mode projection *G*((*Vl,Vr*)*,E*) ⇒ *G* (*Vl,E* ), where *E* are edges between nodes in *Vl*. Some similarity measures use a simple factor based on properties of the two nodes, *u* and *v* (e.g., Jaccard index), while others are the result of a comparison to the expected value from a null-model (e.g., hypergeom).

The z\* falls in the second category, as a combination of the p-value and the z-score statistics of the node-pairwise co-occurrences. Node pairs are ranked more similar if their p-value is smaller and ties are broken by their z-score [24 SPP]. Of key importance is the null-model used–the *fixed degree sequence model (FDSM)*. The FDSM is a random graph model that preserves the degree sequence of the original network while randomizing its nodes' interconnections, or edges. While it has been shown that the FDSM is a superior null-model than simpler graph models [13,14,27], closed-form expressions for the expected co-occurrences, *cooccFDSM*(*u,v*), are not known. These quantities are instead estimated by a random sampling procedure, known as a Markov chain Monte Carlo (MCMC) approach. Algorithm 1 describes the complete calculation of the z\*.

#### **2.1 Ground Truth and PPV***<sup>k</sup>*

Throughout this chapter, we discuss a variety of results for the Link Assessment (LA) using as an example the Netflix Prize dataset1. By setting a threshold, the data are represented as a bipartite graph between users and movies, where an edge (*u,v*) means that user *u* liked (4 or 5 stars in the 1–5 scale) movie *v*. By finding significant cooccurrences between any two movies (*v,w*), a one-mode projection can be obtained [27]. The projection to the movies side was preferred because the users are anonymized, therefore it would be impossible to generate a ground truth of known similar users.

We quantify the quality of the LA by the positive predictive value (*PPVk*) based on a ground truth dataset that contains only pairs of known non-random association, namely movie sequels like Star Wars and James Bond. The *PPVk* is the fraction of correctly identified pairs from the ground truth in the set of the *k* highest ranked pairs of movies, where *k* is the number of pairs in the ground truth (see [3 SPP] for an example).

Building a ground truth for real datasets requires orthogonal information about the data (information that is not available for the LA method being tested) as well as an

<sup>1</sup> Available at https://www.kaggle.com/netflix-inc/netflix-prize-data.

**Algorithm 1:** The complete Link Assessment algorithm, calculating the similarity measure z\*

**Data**: Graph *G*((*Vl,Vr*);*E*) with vertices *Vl* and *Vr* and edges *E*, *Vl* being the vertices of interest;

**Result**: A z\*-score (p-value and z-score) for all pairs of vertices (*u, v*) ∈ (*Vl* ×*Vl*); **<sup>1</sup>** Calculate *coocc*(*u, v*) ∀ (*u, v*) ∈ (*Vl* ×*Vl*); *G*<sup>0</sup> := *G*;

**<sup>2</sup> for** *i := 1 to* |*samples*| **do**

**<sup>3</sup>** *Gi* := *Gi*−1;

**<sup>4</sup>** *Graph randomization:*

```
5 for |swaps| do
```
**<sup>6</sup>** Choose two edges at random in *Gi* and swap them, if no duplicate edge arises from the swap;

**<sup>7</sup>** *Coocc computation:*

**<sup>8</sup>** Calculate *coocci*(*u, v*) ∀ (*u, v*) ∈ (*Vl* ×*Vl*);

**<sup>9</sup>** *Calculate p-value and z-score, i.e., the z\*:*

```
10 p-value(u, v) := (|{i : coocci(u, v) > coocc(u, v) ∀ i ∈ 1..|samples|)}| ∀ (u, v) ∈ (Vl ×Vl);
```

```
11 cooccFDSM(u, v) := {coocci(u, v) ∀ i ∈ 1..|samples|} ∀ (u, v) ∈ (Vl ×Vl);
```

```
12 z-score(u, v) := mean(cooccFDSM (u,v))−coocc(u,v)
                         stddev(cooccFDSM (u,v)) ∀ (u, v) ∈ (Vl ×Vl);
```
reliable method, such as assuming that movies within a sequel are non-randomly similar. Therefore, reliable ground truths are rare, limiting the range of input datasets for which the *PPVk* can be measured.

Recently, however, we have discovered a systematic way of generating synthetic graphs for which the ground truth can be directly extracted, based on the benchmarks proposed in [14]. With that, we are able to conduct experiments for reliably comparing the efficiency and QoR of several LA approaches over an arbitrary range of input datasets. However, this work is still ongoing.

# **2.2 Random Graph Models**

*Network mofits* are subgraphs whose occurrence in the observed data is statistically significant when compared to a random graph model (a null-model). The choice of such a null-model must be well-suited to test the investigator's hypothesis, and an inappropriate null-model can result in misinterpretation of the observed data [8]. The fixed degree sequence model (FDSM) is considered most appropriate for the identification of motifs, and in many cases, only simple graphs should be considered, i.e., no selfloops nor multi-edges. Unfortunately, closed-form expressions for the expected motif frequency over all possible simple graphs with a prescribed degree sequence are not yet known. Therefore, we commonly rely on a comparatively inefficient MCMC approach based on sequential mixing of the sampled graph states.

In [23 SPP] and [22 SPP], we have looked at different null-models, which can be more efficiently generated, as an approximation for the FDSM, as well as developed equations with the same intention. While some 3-node subgraph frequencies can be well approximated by simple equations, the case for the node-pairwise co-occurrences is more complex. For very regular degree sequences, i.e., all nodes have similar degrees, an equation based on the simple independence model is sufficient to estimate the individual node pairs co-occurrences. As the degree sequence becomes more skewed, the true values from the FDSM diverge from the approximation. Even a more intricate approximation for the individual co-occurrences [19, p. 441], whose sum almost matches the true value, becomes inaccurate for high-degree nodes. Since skewed degree sequences are abundant in real networks, such approximations cannot be widely used.

#### **2.3 Co-occurrence Gradient in the FDSM**

In another attempt to avoid the costly MCMC sampling approach for estimating the expected co-occurrences in the FDSM, we have analyzed the co-occurrence gradients throughout the Markov chains–a so-called *mean-field approach*, borrowed from statistical physics. In this approach, we first find the differential equation that describes the expected change, i.e., gradient, in co-occurrence after one single step in the graph mixing Markov chain. If the gradient is sufficient to describe the dynamics of the chain, a closed-form solution for the expected co-occurrences could be derived (see [1] for an example of a successful attempt), or at least an iterative, direct method that is not based on sampling.

In order to fully describe the dynamics of the mixing chains, the co-occurrences gradient, Δ**coocc** = Δ**coocc**(**coocc**), must be a function of only the co-occurrence matrix, **coocc**. If additional parameters are needed, their dynamics must also be represented in differential equations. As it turned out, however, the co-occurrences gradient can only be found if the structure of the graph is taken into account, i.e., **coocc** is not sufficient. Table 1 exemplifies such insufficiency by showing that a centrosymmetric **coocc** matrix (middle) does not result in a equally centrosymmetric Δ**coocc** matrix (right).

**Table 1.** An example of a adjacency matrix (left) whose row-pairwise *co-occurrence (coocc)* matrix (middle) is not sufficient to calculate the expected *coocc* gradient (right). For the sake of clarity, the gradient values w.r.t. the edge switching chain (right) are shown without normalization by the number of possible swaps trials per step, |*E*| <sup>2</sup> = 112.


In fact, the expected change in co-occurrence, a node pairwise relation, can only be given by the interaction between the neighborhoods of three nodes. This becomes clear once we realize that *coocc*(*i, j*) can only be changed if the neighborhood of a third node *k* is modified since the degree sequences are fixed. In turn, the dynamics of the node 3-wise relations can only be described by 4-wise relations, and so on. Therefore, we conclude that a full description of the dynamics of the mixing chains w.r.t. the pairwise co-occurrences is not feasible.

# **3 The Benchmarking Problem**

In general, comparing different system implementations is a non-trivial task. The reason is that plenty of parameters influence the final system behavior, such as the underlying system architecture, the selected algorithms, the chosen software implementation language, compilers, or communication and memory infrastructures. Besides, the performance of a system can heavily depend on the input data, in particular if adaptive ("*self-tuning*") methods are used. Thus, fairly comparing implementations requires an in-depth analysis of the relevant factors and the target application domains first.

In this context, we distinguish between the *application* or *problem* (the actual task to be carried out), the employed *model* or *algorithm* and its final *implementation* on a specific *architecture*. The latter three make up the final *system solution* that we are evaluating.

Let us look more closely at an example: We define the *application* or *problem* as "recommend movies to a client who has already watched several other movies", a generic task, e.g., in a video streaming service. The choice of an appropriate *algorithm* for this task is crucial for the overall system behavior: We can, e.g., select a graphbased approach as discussed in this chapter, statistical analysis, or machine learning based methods [15]. Each of those can be implemented in pure software on a generic computer architecture, in hardware, or in a hybrid hardware/software setting that combines programmable architectures such as central processing units (CPUs) with hardware accelerators. The selection of an appropriate underlying system *architecture* is strongly linked to the chosen algorithm, since there may be strong interactions between those two. Some algorithms are more friendly for being implemented in hardware or accelerators (in particular if they allow high parallel processing), while others may fit more to programmable (i.e., in general sequential or control-driven) architectures. Thus, fixing *algorithm* and *architecture* in the system design flow is an iterative and heavily interdependent process that requires a deep understanding of both domains and the target *application*, since the latter may impose additional restrictions or constraints on the other ones. In particular, low-level parameters such as selecting appropriate data structures for efficient memory accesses or custom data types with reduced-precision can lead to strong increases in performance and energy efficiency, but may also impact the QoR (see Chapter 4).

However, evaluating/comparing *systems* always requires well-defined *metrics*. In order to allow comparisons over architectural borders, those metrics need to be independent of the underlying system architecture and/or employed software. "Operations per second" for example is still a widespread metric in the high-performance computing (HPC) domain, but cannot be applied to systems that incorporate hardware accelerators (mainly data-flow architectures with hard-wired circuits) in which no "operations" exist. Thus, we propose application-level metrics that are not related to the selected algorithms or architectures. Examples are "run-time for a specific task", "consumed energy for a run", and "achieved QoR for a specific task".

In the above-mentioned example "recommend movies to a client who has already watched several other movies", we could, e.g., compare a software implementation running on a CPU-based cluster with a (hybrid) hardware-accelerated architecture. After deciding that we are going to implement a graph-based approach over other available options, we can still select the specific algorithm for the LA part (e.g., Edge Switching (ES) or Curveball (CB), see Sect. 4), the number of processing elements (PEs), the amount of hardware acceleration (if any), the memory hierarchy, communication infrastructure, the data structure (e.g., matrix vs. adjacency list), and the data types (e.g., floating-point precision) that impact both storage demands and required computational effort. It is obvious that a large number of degrees of freedom leads to an overwhelming amount of possible *system solutions* that all solve the same task, but with different characteristics.

While "run-time for a specific task" and "consumed energy for a run" can be measured or estimated with rather straight-forward approaches, determining the "achieved QoR for a specific task" is much harder to quantify. The reason is that in general multiple ways for measuring quality exist that must be investigated more specifically in order to determine which metric provides the most meaningful insights for the specified application.

In addition, stochastic parts of the selected algorithm (e.g., based on former system states, random numbers, or early stopping criteria/heuristics) may even lead to variations over different runs with the same input data. Thus, a robust quality metric not only needs to provide a meaningful quantitative statement of the achieved QoR but should also be determined in a way that minimizes stochastic impacts on the result. In the LA part of our example ("recommendation system"), we could, e.g., use the *PPVk* [24 SPP] as a direct measure of the results or autocorrelation [6 SPP] or perturbation [25] as QoR measure for the mixing itself.

A generic approach for tackling these issues are *benchmark sets* that try to cover specific application areas with typical data points. Most of them consist of so-called *batteries* that combine multiple tests into larger task lists to minimize set up/initialization and read-out overhead and to reduce stochastic effects.

In order to stop a stochastic process when a sufficient QoR is achieved, we employ *heuristics* that perform online tracking of specific QoR measures together with desired target values (so-called *early stopping criteria*). Once the target is achieved, the processing is stopped. For the Link Assessment (LA) problem with the ES chain, we have analyzed how the *PPVk* changes over the number of samples and swaps throughout the processing [4 SPP]. We are using two data sets, the Netflix competition data set and a medium-size MovieLens data set2. More detailed insights are given in Fig. 5.

Figure 4 and Fig. 5 clearly show that the *PPVk* saturates abruptly when a specific number of samples or swaps is achieved (a so-called *phase transition*). From this moment on, further processing does not increase the QoR any more. Thus, we can stop when we detect the phase transition of the *PPVk* and use this as an early stopping criterion for this task.

<sup>2</sup> The 100k MovieLens data set, available from http://grouplens.org/datasets/movielens/.

From this criterion, we can derive an appropriate heuristic that we incorporate into the final implementation. One crucial aspect for such a heuristic is its *stability*, i.e., it must be ensured that it works reliably for the allowed range of input data sets for a given application and that it stops the processing at the earliest possible time when the desired QoR is achieved. We present appropriate heuristics for the LA in Sect. 5.1.

# **4 Edge Switching vs. Curveball**

Generating random samples from the fixed degree sequence model (FDSM) remains the most accurate method for estimating the expected co-occurrences between nodes, and therefore also for performing the Link Assessment (LA). Exact sampling schemes, where random graph samples are generated from scratch and exactly uniformly at random, were proposed but their computational complexity is *O*(*n*3) [9]. Most commonly, the random graphs are generated by sequentially mixing the original graph's edges, specifically using the Edge Switching (ES) Markov chain. Strona et al. [25] proposed a new algorithm, coined the Curveball (CB), which instead of switching a pair of edges, randomly trades the neighborhoods of two nodes. The CB was quickly proven to converge to the uniform distribution and adapted for different types of graphs [5] (see Chapter 2 for more details about the Curveball algorithm).

The mixing time of a Markov chain refers to the number of steps in the chain required to reach any possible state with equal probability3, i.e., to disassociate the final, random state from the initial state [21]. While the true mixing time of neither Markov chain, the ES or the CB, is known, first empirical results suggested that the CB was a more efficient method of randomizing a graph [5,25]. These are based on the perturbation score and discussed in terms of the number of steps in the respective Markov chains. In practice, however, one CB step may take much longer than one ES step, so an actual runtime comparison between implementations is more meaningful.

In this section, we show the runtime comparison between two versions of the CB algorithm and an ES implementation. The Sorted-lists Curveball (SCB) iterates through two randomly selected nodes' neighborhoods (lists) in order to find, shuffle and reassign the disjoint set of neighbors. Although finding the disjoint set is facilitated by keeping sorted lists, after the re-assignment they must be re-sorted in preparation for the next trade, so the overall complexity is *O*(*degmax* ×log*degmax*) per trade, where *degmax* is the maximum node degree of the network. The Hashed-lists Curveball avoids sorting the lists by creating a temporary hash-map of each neighborhood. The complexity of the HCB depends on the properties of the hash-map used, but in general is between *O*(*degmax*) and *O*(*deg*<sup>2</sup> *max*) per trade. Finally, the ES uses two redundant data structures to accomplish a complexity of *O*(1) per swap: the adjacency lists to randomly pick two existing edges; and the adjacency matrix to check whether they can be swapped.

#### **4.1 Perturbation Score**

The perturbation score [25] is the number of different entries between the adjacency matrices of the original graph and the random sample. Since each step in the edge

<sup>3</sup> Within an arbitrary error margin.

switching chain can only swap two edges, at most four entries of the adjacency matrix are modified in each step. The perturbation score, although maybe not a true estimator for the total mixing time, is a direct measure of the distance of the shortest path between two states of the ES chain.

The first comparison between the randomization algorithms was conducted using the Netflix Prize dataset. Figure 1 shows the relative perturbation4 vs. the runtime for a curveball (the SCB) and the ES implementation (shown are averages of at least 10 repetitions). The bottom-most subplot refers to the complete dataset, while the first two refer to random subsets of 10000 and 100000 users, respectively. Since the curveball algorithm is sensitive to the number of adjacency lists (=number of columns in the adjacency matrix), both Movies x Users and Users x Movies representations are simulated. In this analysis, irrespective of the amount of data and number of adjacency lists, the ES implementation is at least 2× faster than SCB.

**Fig. 1.** Relative perturbation achieved by ES and SCB vs runtime for different subsets of the Netflix Prize dataset. Machine: Intel(R) Xeon(R) E5 2640v3 @ 2.60 GHz.

Besides the surprising results showing that the ES implementation using both the adjacency lists and matrix is faster than the CB based on sorted adjacency lists (further discussed in Sect. 4.2), another interesting effect can be seen in Fig. 1: Mixing the

<sup>4</sup> We define the relative perturbation as the perturbation score normalized by its maximum value among all mixing chains and repetition runs.

Users side of the bipartite network is always faster than mixing the Movies side, irrespective of the number of users and movies. An explanation for that may be in the degree distribution of the two types of nodes, users and movies, do not show the same shape. Figure 2 shows the degree distribution densities for users and movies nodes for a subset of 100000 randomly selected users of the Netflix Prize dataset. While relatively more movies have very low or very high degrees, user degrees concentrate in the middle, a trend that holds for any subset of randomly selected users (not shown). Given that a curveball trade consists of shuffling the disjoint neighborhoods of two randomly selected nodes, a higher mixing efficiency is expected when the two nodes have similar degrees. Therefore, we conclude that, when mixing bipartite graphs using the curveball Markov chain, it is advantageous to perform the trades between nodes from the least skewed degree distribution side, at least with respect to the perturbation score vs. the runtime. Note that the ratio between the number of nodes of the two sides does not play a major role w.r.t. the runtime (see Fig. 1), contrary to the belief of the original curveball algorithm inventors [25].

**Fig. 2.** The degree distribution densities of users and movies for a subset of the Netflix Prize dataset containing good ratings from 100000 random users. Users' degrees are more concentrated in the middle while movies' degrees have a more skewed distribution, with many very low and many very high degree movies.

#### **4.2 Runtime Comparison with NetworKit**

In [6 SPP], an I/O-efficient implementation of the curveball trades for simple, unipartite graphs is proposed. Its key feature is the introduction of a trade sequence that is lexicographically sorted before the curveball trades are performed, resulting in a more efficient memory access (see Chapter 2 or [6 SPP] for more details). In cooperation with the Group of Algorithm Engineering from the Goethe University Frankfurt, we compare our ES, SCB, and HCB implementations to a bipartite version of their curveball.

Figure 3 shows the runtime comparison results between our ES, SCB, and HCB and the NetworKit CB [6 SPP] implementations. Each mixing chain executes 20 super

**Fig. 3.** Runtime comparison of the ES and three curveball implementations. The numbers over the colored bars indicate the average runtime in milliseconds for 10 super steps (see text). The black, thin bars indicate the standard deviation of 10 runs. NetworKit CB is consistently faster (at least 2x faster) and scales better than our ES, HCB, and SCB implementations. Machine: Intel(R) Xeon(R) CPU E5-2630v3 @ 2.40 GHz. (Color figure online)

steps, i.e., 10×#*Users*for the curveball and 10×#*Edges*<sup>5</sup> for the edge switching, which considers, in expectation, 20 times the state of each edge, which is when both ES and CB converge in quality according to the autocorrelation thinning factor [6 SPP].

Figure 3 shows that NetworKit CB is at least 2× faster than our CB implementations, as well as faster than the ES. From Fig. 1 we see that our ES is between 1*.*5× to 2*.*5× faster than the SCB, which in turn is between 2× and 3× slower than NetworKit CB. Therefore it is safe to assume that there is a CB implementation that is at least as fast as our ES even according to the perturbation6. Furthermore, it becomes evident from Fig. 3 that the CB–both our SCB and NetworKit's–scales consistently with the size of the graph, while the ES appears to have higher factors.

With these results, we conclude that the Curveball algorithm can indeed be efficiently implemented in software, and scales better than the ES. However, it also became clear that there can be huge differences in results interpretation depending on the quality metric being regarded (perturbation or auto-correlation), and it is not clear which is the most relevant.

# **5 Phase Transition and Heuristics**

Instead of a steady increase in the quality of the LA with the underlying MCMC parameters, mainly the mixing length and number of samples, we actually see a flat low

<sup>5</sup> Notice the factor 0*.*5<sup>×</sup> between the super step definition in [6 SPP] and here. This is due to the representation of unipartite, undirected graphs [6 SPP] requiring duplicated edges, one in each direction, while bipartite graphs do not.

<sup>6</sup> Unfortunately, a direct comparison w.r.t. the perturbation turned out to be difficult because of incompatibilities between the structures of the source codes.

quality followed by a sudden and steep increase. This phase transition-like behavior was first reported in [4 SPP], further discussed in [3 SPP], and is shown in Fig. 4. The LA quality, measured by the *PPVk*, is plotted against the number of samples, the main parameter w.r.t. the total runtime of the method.

**Fig. 4.** Link assessment quality (*PPVV* ) over number of samples. For a wide variety of data sets, narrow phase transitions are present. Similar phase transitions are also seen for the number of edge swaps (see Fig. 7 in [3 SPP]). (Color figure online)

**Fig. 5.** Quality over number of swaps. For a wide variety of data sets, narrow phase transitions are present, similarly to what can be observed varying the number of samples. (see Fig. 2 in [4 SPP]). (Color figure online)

The complete Netflix dataset (blue), e.g., requires 384 samples to reach a *PPVk* of 0*.*4206±0*.*0019, while 16,384 samples only increase this value to 0*.*4217±0*.*0012. While, on one hand, it indicates that a low number of samples is required, the steep transition also cautions us against taking too few samples, since 64 samples instead of 384 would result in roughly half the quality (0*.*20±0*.*03) and 48 samples (0*.*001±0*.*001) is no better than a random guess. Therefore, this is not mainly a trade-off problem, where more resources (samples) bring better quality, but rather a threshold problem, at which the quality transitions steeply from its minimum to its maximum. Finding this threshold can be done via online heuristics, as is discussed in Sect. 5.1.

#### **5.1 Heuristics for MCMC Parameters**

Continuing the sampling procedure after the maximum LA quality is reached results in increased runtime for no benefit. Therefore, being able to reliably assess when maximum quality has been reached can represent great speedups for the complete link assessment method. For example, compared to the commonly used 10,000 samples, we see in Fig. 4 that, for the complete Netflix dataset, the LA reaches maximum quality at around 384 samples, which would represent a speedup of *>* 25×.

Figure 4 also shows that the tipping point depends on the dataset. However, we were not able to find correlations between the input dataset and the required number of samples, and so an analytic formula could not be derived.

Even though the mechanism that causes the phase transitions is not yet completely understood, we know that it must be related to the stability of the ranking of the most significant pair of nodes. We have tested multiple methods to evaluate the stability of the final result by comparing it with the previous one [4 SPP,3 SPP]. A good estimator for the final quality should also present a phase transition-like behavior, as does the quality itself. In [3 SPP] we described in detail our successive attempts to find a reliable heuristic, going from the number of matching pairs at the very top of the ranking, to more sophisticated correlations methods, and finally the internal *PPVk* method. The internal *PPVk* method consists of creating an internal ground truth based on the top ranked node pairs at each iteration and using it to calculate the *PPVk* after a new (group of) sample(s) is drawn. This measure turned out to be stable and well correlated to the observed phase transition of the actual quality, based on the real ground truth.

A rather similar heuristic, in the sense that it also presents a phase transition-like behavior, was devised for assessing the required amount of mixing (w.r.t. the number of edge swaps) necessary [3 SPP]. It is based on the fact that the average *cooccFDSM*(*a,b*) of two nodes *a*, *b* only depends on their degrees. Therefore, we expect to see a converging behavior of multiple node pairs that have the same degree pair if the amount of mixing is sufficient (see Fig. 10 in [3 SPP]).

Table 2 summarizes the speedups achieved by applying each and the two heuristics for the number of swaps and samples presented in [4 SPP,3 SPP]. In all cases, the heuristics accelerate the overall LA (all overheads are accounted for) when compared to the safe number of swaps (|*E*|×log|*E*|) and samples (10000) without any significant degradation of its quality. The largest graph seems to benefit the most from either and both heuristics. For the smallest graph, the Movielens data, the overhead of the swaps heuristic becomes significant, shadowing the increased randomization speed during the actual sampling. Also, it still requires around one-third of the 10000 samples for the internal *PPVk* to converge.

Notice that we can clearly see an interdependence between the number of swaps and the number of samples. For example, using a more conservative set of parameters for the swaps heuristic results in 8×107 edge swaps and 454 samples for the complete Netflix data, almost ten times more swaps but 4 times fewer samples as the less conservative alternative. This is expected since the relevance (or entropy) of each new sample is

**Table 2.** Runtime and quality comparisons of the LA with and without heuristics [3 SPP]. The swaps heuristic has three parameters: the convergence threshold, θ*min*; the number of groups of node pairs with the same degree pair, *Ng*; and the number of node pairs per group, *Np*. The samples heuristic also takes three parameters: the internal *PPVk* threshold, α; the size of the internal ground truth, *k* ; and the number of samples between evaluations, *samplesstep*.


directly related to the amount of mixing between samples - the more independent the samples, the more information each of them adds to the poll. In this case, the more conservative choice is slightly faster.

The trade-off between the amount of mixing between samples and the required total number of samples provides an effective measure of the quality of each random sample. As the quality measure is independent of the mixing chains, we have used it to compare the effectiveness of different chains and their runtimes. An online heuristic that optimizes this trade-off is also under investigation.

#### **5.2 Phase Transitions as Mixing Quality Estimation**

Earlier results from the implementation of heuristics to find appropriate parameters for the MCMC sampling showed that there exists a trade-off between the amount of mixing

**Fig. 6.** The phase transitions in the link assessment quality of the Netflix 10k dataset for increasing mixing lengths of a) Edge Switching, b) Curveball in Movies, and c) Curveball in Users. The vertical line-width represent the standard deviation of 10 independent simulations. Mixing lengths are given in super steps, where 1 super step (Sect. 4.2) is a) |*E*|*/*2, b) |*M*|*/*2, and c) |*U*|*/*2.

and the number of samples required for convergence (Table 2). This insight led us to investigate how the *PPVk* phase transitions behave when we vary the mixing length of the chains.

Figure 6 shows the LA quality phase transitions w.r.t. the number of samples for increasing mixing chain lengths and three different mixing chains. As expected, the phase transitions are shifted to the right (more samples are required) as the mixing length decreases. If the mixing length is sufficient, however, increasing it further may not significantly change the result. This behavior can be explained by the correlation (or level of independence) between consecutive samples, which is expected to decrease exponentially with the amount of mixing. If enough mixing is performed, the sampling procedure became virtually as efficient as it can be, as if each sample was drawn uniformly at random. If the correlation between each consecutive sample is measurable, the sampling efficiency is degraded, therefore the phase transitions both start later and become less steep.

Figure 6 also shows that there can be a great difference in the efficiency of the Curveball algorithm whether we choose to mix (b) movies or (c) users, even if the mixing lengths are normalized w.r.t. the adjacency matrix dimensions. Similar behavior was already observed in Fig. 1, where mixing the users' neighbors was faster in reaching the maximum perturbation. Surprisingly, however, mixing the movies side instead of users is more effective when the LA quality is regarded. To better expose this controversy, we can find the number of samples required to reach a certain *PPVk* threshold, say 0*.*3 in this case. If the perturbation was a good predictor of the sampling efficiency, we would expect the number of samples to be the lowest only when the maximum perturbation is reached.

Figure 7 shows the perturbation and the number of samples required to reach the threshold *PPVk* of 0.3 versus the amount of mixing. For the edge switching (ES) and the Curveball in users (CB\_C), there seems to exist a strong (inverse) correlation between the perturbation and the number of samples. When the Curveball mixes the movies neighborhoods (CB\_R), however, the perturbation is far too conservative. While the

**Fig. 7.** The perturbation (left axis) and the number of samples until the *PPVk* reaches 0.3 (right axis) vs the mixing length for the Netflix 10k dataset. The error bars represent the standard deviation of 10 independent simulations. Mixing chains: Curveball in Users (CB\_C); Curveball in Movies (CB\_R); Edge Switching (ES).

minimum number of samples of around 1200 is reached at 2.5 super steps, only at 150 super steps does the perturbation reaches its maximum value - an overdo of 60×.

# **6 Summary and Conclusion**

In this chapter, we summarize strategies for increasing the sampling efficiency for the Link Assessment (LA) problem. Although we provide specific results for the latter one, our approach is generic and can be applied to related applications. Since in many cases no closed-form solutions exist, stochastic Markov chain Monte Carlo (MCMC) need to be employed. However, their main drawbacks are the high computational demand and the lack of reproducibility of exact numerical results due to their stochastic components. In addition, different algorithms may be used for generating the MCMC samples (edge switching vs. curveball), and their performance on different compute architectures can vary strongly. In many cases, it is not clear which algorithm is the best one for a specific data set on a specific architecture with respect to performance or quality.

Thus, we highlight the importance of application-level benchmark sets together with application-level, numerical measures for the quality of results (QoR) of specific runs. Such benchmarks allow a fair and quantitative comparison of non-architecturally linked metrics such as "energy per run", "time per run", or "quality of the results". In addition, for many real-world data sets, we do not have any kind of ground truth that could serve as a test oracle when evaluating the achieved quality. In order to overcome this issue, we propose to construct meaningful benchmark batteries (together with ground truths) for specific application domains artificially that cover the main tasks and the corner cases equally.

Furthermore, we consider the *PPVk* being a good quality measure for the LA problem. However, since it is strongly linked to the ground truth not available in many cases, we have investigated alternative metrics such as autocorrelation or perturbation. Those also allow the construction of unsupervised systems that are able to determine a "sufficient" QoR on their own. We clearly observe correlations between the latter ones and the *PPVk* in Fig. 7, but a comprehensive analysis of any formal causalities between them is currently ongoing.

Finally, we propose a working heuristic for an LA solver based on edge switching that exploits observable phase transitions of the *PPVk* as an early stopping criterion for the MCMC process. This heuristic provides speedups of up to 15*.*4× compared to solvers that use conservative approaches.

# **References**

	- 5. Carstens, C.J., Berger, A., Strona, G.: Curveball: a new generation of sampling algorithms for graphs with fixed degree sequence. CoRR abs/1609.05137 (2016). https:// doi.org/10.1016/j.mex.2018.06.018
	- 7. Duranton, M., et al.: HiPEAC vision 2019. In: European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC) (2019)
	- 8. Fosdick, B.K., Larremore, D.B., Nishimura, J., Ugander, J.: Configuring random graph models with fixed degree sequences. SIAM Rev. **60**(2), 315–355 (2018). https://doi.org/ 10.1137/16M1087175
	- 9. Genio, C.I.D., Kim, H., Toroczkai, Z., Bassler, K.E.: Efficient and exact sampling of simple graphs with given arbitrary degree sequence. CoRR abs/1002.2975 (2010)
	- 10. Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. Knowl. Discov. Data **1**(3), 14 (2007). https://doi.org/ 10.1145/1297332.1297338
	- 11. Goh, K.I., Cusick, M.E., Valle, D., Childs, B., Vidal, M., Barabási, A.L.: The human disease network. Proc. Natl. Acad. Sci. **104**(21), 8685–8690 (2007). https://doi.org/10. 1073/pnas.0701361104
	- 12. Goldberg, D.S., Roth, F.P.: Assessing experimentally derived interactions in a small world. Proc. Natl. Acad. Sci. **100**(8), 4372–4376 (2003). https://doi.org/10.1073/pnas. 0735871100
	- 13. Gotelli, N.J.: Null model analysis of species co-occurrence patterns. Ecology **81**(9), 2606–2621 (2000). https://doi.org/10.1890/0012-9658(2000)081[2606:NMAOSC]2.0. CO;2
	- 14. Gotelli, N.J., Ulrich, W.: The empirical Bayes approach as a tool to identify nonrandom species associations. Oecologia **162**(2), 463–477 (2010). https://doi.org/10. 1007/s00442-009-1474-y
	- 25. Strona, G., Nappo, D., Boccacci, F., Fattorini, S., San-Miguel-Ayanz, J.: A fast and unbiased procedure to randomize ecological binary matrices with fixed row and column totals. Nat. Commun. **5**(1), 4114 (2014). https://doi.org/10.1038/ncomms5114
	- 26. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: ICCV, pp. 843–852. IEEE Computer Society (2017). https://doi.org/10.1109/ICCV.2017.97
	- 27. Zweig, K.A., Kaufmann, M.: A systematic approach to the one-mode projection of bipartite graphs. Soc. Netw. Anal. Min. **1**(3), 187–218 (2011). https://doi.org/10.1007/ s13278-011-0021-0

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# A Custom Hardware Architecture for the Link Assessment Problem

André Chinazzo(B) , Christian De Schryver, Katharina Zweig, and Norbert Wehn

TU Kaiserslautern, Kaiserslautern, Germany {chinazzo,schryver,wehn}@eit.uni-kl.de, zweig@cs.uni-kl.de

Abstract. Heterogeneous accelerator enhanced computing architectures are a common solution in embedded computing, mainly due to the constraints in energy and power efficiency. Such accelerator enhanced systems dispatch data- and computing-intensive tasks to specialized, optimized and thus efficient hardware units, leaving most control flow tasks for the more generic but less efficient central processing units (CPUs). Nowadays, also high-performance computing (HPC) systems are becoming more heterogeneous by incorporating accelerators into the computing nodes.

In this chapter, we introduce the concept of heterogeneous computing and present the design of a hardware accelerator for solving the Link Assessment (LA) problem, in introduced Chapter 3. The hardware accelerator integrates its main dedicated processing units with a customized cache design and light-weight data path. We provide detailed area, energy, and timing results for a 28 nm application specific integrated circuit (ASIC) process and DDR3 memory devices. Compared to an CPUbased cluster, our proposed solution uses 38x less memory and is 1030<sup>x</sup> more energy efficient for processing a users-movies dataset with half a million edges.

Keywords: Link assessment · Application specific · Custom hardware · DRAM

# 1 Introduction

Nowadays, we live in the era of the so-called *data deluge*, i.e., the increase in produced data supersedes the progress in the available compute performance. This poses heavy challenges on data-centric (statistical) methods, algorithms, and compute systems [18]. Among others, selecting the appropriate data structures, heterogeneity, and parallelization schemes are crucial for achieving high computing performances with low energy demands. For example central processing unit (CPU)-based systems can only access data stored in memory as complete words (cache lines) and work with fixed data types. In contrast, dedicated hardware accelerators allow custom bit widths and data types. This can not only save energy due to avoiding unnecessary data transfers and operations but also allowing direct bit-wise operations like, e.g., accessing one-bit-column entries in a matrix.

In general, standard computing architectures based on CPUs and graphics processor units (GPUs) are moving data around heavily. However, in modern technologies, data transfers and storage in general consume much more power than the actual computing [5]. In particular, accessing (off-chip) dynamic random-access memory (DRAM) is a very time- and energy-consuming task. This leads to the concept of the so-called *data-driven* or *dataflow computing*, e.g., employed in the Google TensorFlow architecture [5]. Such architectures focus on the data stream and manipulate data on-the-fly, avoiding unnecessary storage and data transfers.

In addition, in data centers, servers alone only consume around one-third of the total power, while the rest is required for cooling, communication, storage, and building supply [8]. Seen from a different perspective, the maximum available power budget of a system (or a data center) is a hard limit for the available computing power. The latter can only be increased by installing compute systems with a higher power efficiency (e.g., incorporating special hardware accelerators, for instance with a dataflow architecture). Thus, reducing the power demand of the compute servers in combination with the smart reduction of inter-server communication can lead to a total of 2-3x power savings in the data center itself.

Modern system on chips (SoCs) in the mobile, embedded, and Internet-of-Things (IoT) domain are heavily heterogeneous systems with plenty of custom components for dedicated purposes such as audio decoding, video en- and decoding, radio transmission, or sensor data pre-processing in a mobile phone. In particular for mobile devices, there are hard limits for both energy (battery capacity) and power (maximum heat dissipation). However, over the last decades we see more and more heterogeneity also in the data centers [1,5]. Examples are general purpose graphics processor units (GPGPUs), the Intel Xeon Phi accelerator cards, or the field programmable gate array (FPGA)-based Amazon EC2 F1 instances released in 2017<sup>1</sup>. One of the major reason is the so-called *Dark Silicon* phenomenon: In modern chip technologies, only a small amount of transistors can be active at a time in order to avoid overheating (and thus destruction) of the device [7]. This also poses a heavy challenge for the classical multi-core approach - more cores of the same type do not provide more computation power if they cannot be powered up all at the same time.

Nevertheless, end-users are not at all interested in the underlying technology of the *services* they use. Nowadays, most services are distributed over an information technology (IT)-infrastructure from IoT nodes, mobiles, edge servers, and data centers [13]. Thus, the overall application is partitioned and disseminated on various parts of the IT-infrastructure, all with probably different computing architectures and characteristics. As an example, consider a real-time navigation service from Google or Apple: The Global Positioning System (GPS) coordinates collected by (maybe external) GPS receivers are sent to the SoC of the mobile that acts as a human-machine interface (HMI), displaying the route. However, the route itself is calculated in a data center of the service provider. In addition, GPS data from other service users is employed for estimating traveling times and traffic jams, and incorporated in the route calculation.

<sup>1</sup> See https://aws.amazon.com/ec2/instance-types/f1/. Last accessed on 24/11/2022.

In this chapter, we give an overview of hardware-assisted compute systems for applications based on the *Link Assessment (LA)* algorithm. The LA algorithm can be used to clean up large network data sets with noisy data. It assesses the structural similarities between the nodes, and thus differentiates meaningful relationships between nodes from noisy ones [19 SPP]. The LA algorithm as presented in Chapter 3 can be employed on a large scale of applications, e.g., recommendation systems, protein-protein interaction analyses in biology, or business analytics and marketing [3 SPP].

In Sect. 2 we give a short overview about the fundamentals of hardware (HW) and hardware/software (HW/SW) design both for custom application specific integrated circuit (ASIC) and FPGA architectures. Section 3 provides detailed insights in our proposed HW architecture for the Link Assessment (LA) algorithm. Performance data and comparisons are given in Sect. 4. Section 5 concludes this chapter.

# 2 Basics of Hardware and Systems Design

Custom, dedicated hardware compute architectures are substantially different from standard programmable architectures such as CPUs or GPUs. They are tailored for a specific task, avoiding all unnecessary overhead in storing/moving data, for control architectures, and over-precision data types. This increases both compute performance and power/energy efficiency, at the cost of low to zero flexibility after design. In contrast to a program written for CPUs, hardware architectures, in general, do not receive and execute instructions. Instead, their behavior is encoded in the circuit itself.

Hardware accelerators are electrical (abstracted: digital) circuits that focus on data manipulation. They can be realized in three ways:


Nowadays, most systems are realized on a so-called *system on chip (SoC)*. In contrast to discrete circuits realized on PCBs, a SoC combines most components on a single piece of silicon. For that purpose, various *processing elements (PEs)* are attached to a communication infrastructure (a bus or a *network on chip (NoC)*). In addition, external input/output (I/O) interfaces are provided for receiving from and sending data to the outside world. An example for such a SoC structure is given in Fig. 1.

In general, not all PEs are developed by the system designer (team) on their own. Instead, many component architectures are available for purchasing as socalled *intellectual property (IP)*, i.e., as hardware geometry or as design data given in a hardware description language (HDL) or a logical netlist. They mostly ship with an equivalent software model that can be used for behavioral analysis, testing, and debugging purposes. IP cores can somehow be compared to software libraries in programming since they offer predefined functionalities that can be incorporated into the overall systems. However, most IP cores are closed-source and only available on a commercial basis. In contrast to software projects, opensource hardware platforms such as *opencores.org* are very limited, both from their available contents and their technology.

Fig. 1. Example for a SoC with processing elements, interconnect, and interfaces (By en:User:Cburnett - Own work in Inkscape based on en:Image:ARMSoCBlock Diagram.gif, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=286 6881)

#### 2.1 Hardware/Software System Design Flow

The generic (classic) *design flow*<sup>2</sup> for custom computing systems is shown in Fig. 2. It is much more complex than a pure software development flow. The flow starts with a so-called *hardware-software-partitioning* that determines which parts of the overall behavior will be realized in hardware or software. While considering available hardware and software IP in conjunction with functional and non-functional requirements such as throughput, energy/power limitations, or quality aspects, the system (architecture) platform is determined. After a preliminary simulation, the actual implementation of the hardware and software components starts. Finally, the system components, their interaction, and the final system behavior are validated.

Since we expect software development flows to be well-known by the readers of this chapter, we will focus on the hardware development part in the following.

#### 2.2 FPGA Basics

Hardware architectures realized in an *application specific integrated circuit (ASIC)* can no longer be changed after production (they are fixed geometries in silicon). In contrast, a *programmable logic device (PLD)* is shipped as a device with plenty of available hardware units that can be connected after production. This *programming* or *configuration* can be either one-time<sup>3</sup> or multiple times. A prominent example for the latter is a *field programmable gate array (FPGA)*.

FPGAs are hardware devices that come with a large amount of flexible small hardware units, so-called lookup tables (LUTs). They are basically very small random access memorys (RAMs) that are written during the boot process (*"configuration"*) of the FPGA. Besides, FPGA provide a complex and flexible interconnect system that is configured together with the LUTs. Furthermore, special components such as Block RAMs (BRAMs), fixed bitwidth multiply-accumulate (MAC) units, multipliers, and I/O components are available.

FPGAs do not have a functional behavior before being initially configured. Some types can even be (partially) re-configured during operation, i.e., changing (parts of) the circuit while the rest of the system continues running. Thus, systems equipped with FPGAs allow a very high level of flexibility and dynamics (however, at the cost of an immensely complex design flow, see Fig. 2). In addition, combined CPU/GPU-FPGA systems are available, both in the highperformance computing (HPC)/data center and the embedded SoC domain.

The acquisition of the FPGA vendors Altera by Intel in 2015 and Xilinx by AMD in 2020 shows the potential of this technology for the future of the computing landscape.

<sup>2</sup> A lot of different elaborate system design flows exist [2,11,17] that are omitted here for the sake of clarity.

<sup>3</sup> One-time programmable devices are physically modified during the programming, e.g., by burning connections or melting so-called *antifuses* that create a conducting connection afterwards.

The proposed hardware architecture for computing the Link Assessment (LA) algorithm can be realized both on ASICs and FPGAs. In the following, we present our architecture in detail and illustrate the differences compared to classical CPU implementations.

Fig. 2. Generic design flow for a SoC (By Traced by User:Stannered - en:Image:So CDesignFlow.gif, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid= 1864027)

# 3 Hardware Architectures for the Link Assessment Computation

Many applications in the big data context are based on fast and reliable identification of so-called *network motifs* in large networks, i.e., those subgraphs whose occurrence is significantly higher than expected in a random graph model [15]. This enables analyzing large-scale biological data in bioinformatics, connections in social networks, incident detection, and general graph data cleaning procedures by LA [22 SPP].

Network motif detection is actively investigated in current research, but mainly from the algorithmic point of view. From the implementation side, nearly all available work deals with mapping the motif detection problem on parallel CPU and GPU based clusters [9,14].

For the Link Assessment (LA) algorithm, we consider a special variant of motifs, the so-called *co-occurrence (coocc)* which is defined as the number of common neighbors between two nodes of graph. Formally, *coocc* (*u, v*) = <sup>|</sup>*N*(*u*) <sup>∩</sup> *<sup>N</sup>*(*v*)<sup>|</sup> for any pair of nodes *u, v* <sup>∈</sup> *<sup>G</sup>*, where *<sup>N</sup>*(*u*) is the neighborhood of node *u* in graph *G*. Throughout this chapter, we use the shorthand "*coocc* matrix of a network/graph" in place of "the set of all node-pairwise *cooccs* of a network/graph," or *coocc*(*G*) = {*coocc*(*u, v*) <sup>∀</sup>*u, v* <sup>∈</sup> *<sup>G</sup>*}. For a bipartite graph, *<sup>G</sup>* <sup>=</sup> *<sup>G</sup>*(*Vl, V<sup>r</sup>*; *<sup>E</sup>*), with vertex partitions *<sup>V</sup><sup>l</sup>* and *<sup>V</sup><sup>r</sup>* and edges *<sup>E</sup>* <sup>⊂</sup> (*V<sup>l</sup>* <sup>×</sup> *<sup>V</sup><sup>r</sup>*), the *coocc* matrix can be defined for either partition, e.g., *coocc*(*V<sup>l</sup>*) = {*coocc*(*u, v*)∀(*u, v*) <sup>∈</sup> (*V<sup>l</sup>* <sup>×</sup> *<sup>V</sup><sup>l</sup>*)}, in which case *<sup>V</sup><sup>l</sup>* is called the side of interest. In this chapter, we focus on bipartite graphs.

The *coocc*(*u, v*) by itself is a way of quantifying the similarity of nodes *<sup>u</sup>* and *v*. However, it is a strongly biased quantifier, e.g., w.r.t. the degree of the nodes. The LA algorithm reduces such biases by comparing the observed *coocc* of the real network with its expected value for a random graph model (null-model), namely the fixed degree sequence model (FDSM) [22 SPP,19 SPP]. As the name suggests, the FDSM is the set of all graphs configurations that share the same degree sequence as the observed graph, and it has been shown to provide more robust results than simpler null-models [22 SPP]. Since closed-form solutions for the expected co-occurrences, *cooccFDSM*(*u, v*), are not known, these quantities are estimated by a random sampling procedure, known as a Markov chain Monte Carlo (MCMC) approach.

The MCMC approach is divided in two main steps: (1) the randomization of the graph by repeatedly swapping its edges until an uncorrelated, and hence unbiased sample of the FDSM is reached, and (2) the computation of the sample's *cooccs*. Of key importance are (a) the number of swap trials between samples and (b) the number of samples drawn from the FDSM. For the interested reader, Chapter 3 presents the LA in more detail, including an in depth analysis of the effect of those parameters, (a) and (b), on the final quality of the results as well as on the total runtime of the algorithm. In fact, MCMC sampling is the most time consuming part of the LA algorithm.

Once enough samples have been created and evaluated, the node-pairwise similarities are calculated as the probability of finding, in the FDSM, a *coocc*(*u, v*) greater or equal than that of the original graph. The higher the probability, the lower the similarity between (*u, v*). The probability is estimated first by the p-value and ties are broken by the z-score (see Chapter 3).

In Sect. 3.2 we show that the LA performance is strongly bounded by the speed of the random accesses to the main memory. Aiming to reduce the effects of this unavoidable constraint, in 2015 we have presented the first dedicated embedded hardware accelerator optimized for this task [4 SPP]. Precisely tailored cache memories and computational units for the *coocc* calculation help reducing the number of random accesses by using a rather naive representation of the graph, which is not optimal for CPUs. This work is the basis for a granted patent [21 SPP].

In a follow-up work [3 SPP], we exploit the granularity of DRAM devices to increase the efficiency of main memory accesses during the random graph creation (the null model). We demonstrate the performance of our design with the Netflix Prize data set<sup>4</sup> and show that a single ASIC instance has a speedup of 5.6x compared to a 10-node Intel cluster while requiring 38x less memory and 1030x less energy.

#### 3.1 Data Structures

The Link Assessment (LA) requires two main pieces of information: The graph and the co-occurrence and similarity measures matrices.

The graph is used by both compute kernels, i.e., the edge swapping (see Chapter 2) and the *coocc* calculation. The edge swapping kernel consists of randomly selecting two edges, (*u, w*) and (*v, x*) for *u, v* <sup>∈</sup> *<sup>V</sup><sup>l</sup>* and *w, x* <sup>∈</sup> *<sup>V</sup>r*, and swapping their connections, to get (*u, x*) and (*v, w*), if this does not modify the degree sequences of *V<sup>l</sup>* and *Vr*. For the edge swapping to have a constant compute complexity, the data structures must provide direct access to existing edges of the graph (random edge selection) and a constant time check for the existence of the new, swapped edges (to preserve the degree sequences). While the adjacency list representation of the graph solves the first task, its adjacency matrix solves the second. Using only one of the data structures would drastically slow the edge swapping procedure. Therefore, we make use of both graph representations, as formalized next.

Given a bipartite graph *<sup>G</sup>*(*Vl, V<sup>r</sup>*; *<sup>E</sup>*) consisting of the vertex partitions *<sup>V</sup><sup>l</sup>* and *<sup>V</sup><sup>r</sup>* and the edges *<sup>E</sup>* <sup>⊂</sup> (*V<sup>l</sup>* <sup>×</sup> *<sup>V</sup><sup>r</sup>*), an adjacency matrix *<sup>A</sup>* = (*V<sup>l</sup>* <sup>×</sup> *<sup>V</sup><sup>r</sup>*) is stored. An entry in the matrix is *<sup>A</sup>u,w* = 1 if (*u, w*) <sup>∈</sup> *<sup>E</sup>*, with nodes *<sup>u</sup>* <sup>∈</sup> *<sup>V</sup>l, w* <sup>∈</sup> *<sup>V</sup>r*. It is sufficient to store *A* with one bit per entry and a total storage requirement of |*Vl*|·|*Vr*| bits. The adjacency list representation is simply the list of all edges *E*, requiring <sup>|</sup>*E*|(log<sup>2</sup> <sup>|</sup>*Vl*| <sup>+</sup> log<sup>2</sup> <sup>|</sup>*Vr*|) bits.

One *coocc* half-matrix is necessary for storing the real graph *cooccs*. It is a half-matrix since *coocc*(*u, v*) = *coocc*(*v, u*), and each pair of nodes (*u, v*) <sup>∈</sup> (*V<sup>l</sup>* <sup>×</sup> *<sup>V</sup><sup>l</sup>*) must be evaluated. A second and identical structure is necessary for

<sup>4</sup> Available at https://www.kaggle.com/netflix-inc/netflix-prize-data. Last accessed on 24/11/2022.

storing the *cooccs* of each random graph sample. Instead of keeping as many *coocc* half-matrices as the number of samples, the similarity measures, p-value and z-score, are updated after each sample. For the p-values, a single half-matrix is required. For updating the z-score, it is sufficient to keep the sum and the sumof-the-squares of the samples' *coocc*.

A summary of the memory footprint of each data structure is shown in Table 1.


Table 1. Memory footprint of the data structures for the LA

#### 3.2 Memory Boundedness

In order to demonstrate the memory boundedness of the LA, we use the roofline model [20] to profile a parallel, optimized CPU implementation of the algorithm. The roofline model is a visualization tool intended to evaluate the efficiency of computation kernels w.r.t. the underlying hardware. The maximum performance of the hardware is bounded, of course, by its maximum number-crunching speed, but also by the memory access bandwidth. These bounds are represented by the black lines (the Rooflines) in Fig. 3. The performance of a computing kernel is measured in operations per second, i.e., how busy the processor really is. Only integer operations (INTOP) are considered because the LA does not use floatingpoint numbers, and the G in GINTOP stands for Giga, i.e." billions of integer operations. The arithmetic intensity is defined as ratio between the number of operations over the total memory traffic, being measured in operations per byte.

The performance and arithmetic intensity of the edge swapping and the *coocc* computation kernels were measured by Intel Advisor<sup>5</sup>. They are presented in Fig. 3. We can see that the performance of the kernels are 1.3 GINTOP/s for edge swapping and 2.2 GINTOP/s for the *coocc* calculation. This is far from the attainable value by the CPU (109 GINTOP/s). This is because of the low arithmetic intensity of both kernels, as is expected from their tasks. The edge swapping kernel, for example, needs to access multiple random memory locations to only check for the existence of an edge, hence many bytes are accessed but

<sup>5</sup> https://software.intel.com/content/www/us/en/develop/articles/intel-advisorroofline.html.

very little processing happens. Most of the time, this kernel is simply waiting for the data to be loaded, what we call a memory stall. During the stall, no processing occurs.

The impact of the stalls on the total runtime are given per memory hierarchy level. We can see that more that half of the total runtime is spend waiting for the DRAM. Moreover, the DRAM stalls account for almost 80% of the edge swapping runtime. This is expected from the intrinsically random memory access pattern of the edge swapping, which means that the cached data is hardly ever used.

Fig. 3. Roofline analysis of the main compute kernels for the Link Assessment: Edge swapping and co-occurrence computation. Both kernels are strongly memory bounded, with 79% and 54% of the runtime spent in DRAM stalls for the edge swapping and *coocc* kernels, respectively. Machine: Intel Xeon E5-2640 v3 (16 cores at 2.6 GHz) with 2 × 32 GB DRAM.

#### 3.3 Co-occurrence Calculation

Calculating the node-pairwise *coocc* of a given graph is the most time consuming part of the LA. Using the adjacency matrix, we iterate through each pair of rows (nodes in *Vl*) and count the number of columns (nodes in *Vr*) where both elements are 1, i.e., both edges exist. The computational complexity of this procedure is, therefore, *<sup>O</sup>*(|*Vl*<sup>|</sup> <sup>2</sup> · |*Vr*|).

Through the adjacency list, the complexity can be amortized to *O*( - *<sup>V</sup><sup>r</sup> deg*(*w*)<sup>2</sup>), where *deg*(*w*) is the degree of node *<sup>w</sup>* <sup>∈</sup> *<sup>V</sup>r*. This particularly benefits networks whose degrees follow a power-law distribution, as is the case of most real networks [22 SPP]. For a CPU implementation of the LA, the adjacency list approach is preferred, even though the memory access pattern is unstructured (see Sect. 3.2).

From a hardware architecture design perspective, however, the adjacency matrix approach can be easily implemented with blocks of bit-wise *AND*s followed by an adder tree, what we call *coocc* module. Due to the small size of such an operational block, it can be replicated multiple times, reaching a degree of parallelism that is not feasible in CPUs. To make use of such high parallelism without being constrained by the DRAM bandwidth requires a well-designed cache layout.

Calculating the *coocc* between all pairs of vertices in *V<sup>l</sup>* in a naive way requires to load the same data many times. For example, calculating the *coocc* between *u, v* ∈ *V<sup>l</sup>* requires edges connected to *u* and *v*, or in other words the two rows *u* and *v* of the matrix *A*. When the *coocc* is later calculated between *u* and *w*, the same row *A<sup>u</sup>* needs to be loaded. This leaves huge potential for an optimized memory hierarchy and algorithms to minimizing data transfer.

We presented an appropriate solution for this issue in 2015 [4 SPP]: The key idea was to add a row-cache to the *coocc* module. The row-cache must be able to store one complete row of the adjacency matrix.

Having *k* parallel *coocc* units, we use their caches to store a consecutive block of *k* rows *Au, .., Au*+*k*−<sup>1</sup>. Then we stream one by one all following rows through the *coocc* modules, starting with *Au*+*k*. With each new row *A<sup>v</sup>* the modules can calculate the *coocc* of all pairs of the cached rows (*u, v*)*, ..,*(*<sup>u</sup>* + *<sup>k</sup>* <sup>−</sup> 1*, v*). Algorithm 1 formalizes this scheme.



The main advantage of this scheme is solving the scaling problem. While adding *m* times more modules reduces the runtime by a factor of *m*, it does not increase the requirements for external bandwidth since only one row has to be streamed through all the blocks at each given time. This allows us to place hundreds, if not thousands, of *coocc* units next to each other, providing massive speedups.

Figure 4(a) shows the data path tailored to this task, consisting of an adder tree and accumulator. Each edge cache has a capacity of 64 kB, targeting a frequency of 400 MHz. For a 64 bit double data rate (DDR) channel at 800 MHz, we get 256 edges per cycle when running the *coocc* units at 400 MHz. That means the adder tree has a width of 128 adders at the top and a depth of seven stages. Four *coocc* modules are synthesized in a single cell and combined in a grid of 5 times 12, for a total of 240 *coocc* modules. To distribute the data to the caches or to stream further rows of the matrix a tree-like replication network is used,

Fig. 4. The *coocc* and result module (b) works on one dataset after another, always updating the same result. It loads one row of the graph into the caches (local memory (LMEM)) and first calculates the *coocc* before calculating the similarity measures. The *coocc* module (a) consists of an efficient adder tree operating on blocks of *l* edges per cycle. While the similarity measures, lower half in (b), consists of several arithmetic blocks and it is only called once per row, making it possible to share most of the resources.

Fig. 5. ASIC layout in 28 nm technology. It consists of 240 *coocc* modules, three DRAM controllers (green) and IO logic. The swap randomization block is not visible here due to its small size .(Color figure online)

while for the results a shift register over the whole chip is used. That makes the architecture perfectly scalable.

In total, this architecture accumulates the *coocc* from 240 <sup>×</sup> 256 = 61*,* 440 matrix columns per cycle, or <sup>∼</sup>24.5×10<sup>12</sup> columns per second. In a comparison with the fastest CPU based population count [16] running at 3.4 GHz, that represents a speed up of ∼59×.

The rest of the design is occupied by memory controllers and IO, see Fig. 5. For the memory controllers, we have estimated the numbers based on the corresponding publications [6,10]. The whole ASIC has a size of 51.2 mm<sup>2</sup> and average power consumption of 11.7W.

Partial-Line Cache Optimization. In a follow-up work [3 SPP], we further increased the efficiency of the hardware architecture by introducing the concept of partial line caches. Since the area of the *coocc* modules are dominated by their cache, reducing the cache size enables much higher degrees of parallelism. However, if the *coocc* modules cannot hold an entire row of the adjacency matrix, the partial results must be temporarily stored, raising the question of the optimal cache size for achieving the best performance.

As will be detailed in Sect. 3.4, higher granularity DRAM channels (shorter word-sizes) can be used to accelerate the graph randomization step. However, they increase the latency of accessing the adjacency matrix rows, therefore presenting the worst-case for the *coocc* computation.

In Fig. 6 we have simulated the time it takes to process the adjacency matrix with the time it takes to store the partial result on average for one line segment when using a channel word-size of 8-bit (the smallest possible). Since those operations are pipelined, the optimal cache size is given by the Pareto front between the two operations. The smallest latency is reached for a cache size of 8 kB.

Fig. 6. Latencies of accessing the input data stored in Adjacency Matrix rows in comparison with latency for storing partial *coocc* results, assuming the same channel width for both memories involved in the design. The Pareto front is the maximum of each. Numbers are for 8-bit channel DRAMs.

While in the first design 240 *coocc* modules with 64 kB caches have been used, 8 kB caches allow us to increase the number of modules up to 1920 for the same total cache size. This results in a similar total chip area, from 51.2 mm<sup>2</sup> to 57.3 mm2, as the caches dominate the *coocc* module in both cases. With this approach, we could further reduce the runtime of the *coocc* computation by a factor of 8× when using the same 64-bit channels, or maintain the same speed when using 8-bit channels.

# 3.4 Swap Randomization

With the accelerated *coocc* computation, the generation of each sample, i.e., the randomization of the graph becomes the bottleneck. Edge swapping is a strictly sequential operation in that any swap can depend on the result of the last swap, therefore its parallelization is not as straightforward as instantiating more processing units. Nevertheless, we addressed this bottleneck by exploiting the fine-grained access to DRAM [3 SPP], what is only possible when implementing our own memory controller, as well as a collision-aware swap parallelization.

Fine Grained DRAM Access. Most modern CPU have a fixed size interface of 64 bits with the DRAM. DRAM devices, however, can have higher granularity interfaces of, e.g., 8 bits (×8), and they are physically combined into groups of 8 devices to build the 64 bits interface. A fixed burst length of 8 DRAM accesses fills up one cache line of 512 bits, or 64 bytes. For any modern CPU, one cache line, i.e., 512 bits, is the minimum amount of data that can be loaded from DRAM.

Since the swap randomization operates only on single integers and single bits, reducing the word length of the DRAM interface increases the "computations per loaded bit" (the arithmetic intensity, see Fig. 3) immensely. Indirectly, of course, it also increases the performance because the swap randomization is bounded by the random memory access latency.

We have derived an alternative hardware architecture that slightly modifies the memory controller in order to address each of the DRAM devices (with 8 bit interfaces) independently [3 SPP]. Normally, the memory controller addresses all 8 DRAM devices of a memory channel as if it was a single device, i.e., it sends the same commands and addresses to all devices. This allows the memory channel to share the command and address lines for all devices, saving energy and area at the cost of having a common address space. The data lines (8 or 64 per <sup>×</sup>8 or <sup>×</sup>64 device), on the other hand, cannot be shared, as the data in each DRAM device must be transferred independently. By introducing a *chip select signal* and interleaving the commands to each DRAM device, we can transform the common address space into 8 independent ones. This works because, during the DRAM latency (data request to data ready), the DRAM device ignores address and command lines, as they get internally saved at the request moment. That way, we can load only 8 <sup>×</sup> 8 = 64 bits instead of 8 <sup>×</sup> 64 = 512 bits in one DRAM device access. This is a slight modification in the memory controller and channel, but one that could not be accomplished without custom hardware design.

For that scheme to be the most efficient, it requires that the data stored in each DRAM device to be independent. That is, each DRAM chip holds its own copy of the graph, as shown in Fig. 7(b). With that we can read or write 8 random numbers in the same time with a <sup>×</sup>8 channel compared to a single with one <sup>×</sup>64 channel, as shown in Fig. 7(c)(d). This scheme speeds up the swap randomization by a factor of 4<sup>×</sup> up to 8×.

Figure 7(a) shows the alternative architecture using two <sup>×</sup>64 memory channels. This design is more suitable whenever the *coocc* calculation is the bottleneck of the algorithm, while the design in Fig. 7(b) provides faster graph randomization. This trade-off is depicted in Sect. 4.

Fig. 7. Showing the ASIC for two memory configurations: <sup>×</sup>64 (a) and <sup>×</sup>8 (b) channels. In the case (a) only two graphs are stored and one swap unit is active, while in case (b) 23 graphs are stored and 22 swap units are active. Architecture (a) is useful for small number of swaps, while architecture (b) is useful for high number of swaps. Showing how the different random reads are performed for a <sup>×</sup>64 channel (c) and <sup>×</sup>8 channels (d). By interleaving the random accesses of 8 swap units with chip select over one command and address channel, 8 reads can be performed for (d) in the same time as one read for (c). This results in an 8<sup>×</sup> speedup.

Collision-Aware Swap Parallelization. Edge swapping is an inherently sequential operation in that every step can depend on the previous ones. For large graphs with millions of edges, we access the memory at random locations for billions of chained swaps. Even then, we can divide the edge swapping chain into chunks that can be processed in parallel, if we make sure that none of the swaps depend on the previous ones in the same chunk. These chunks can be reordered by the memory controller in order to ensure the minimum amount of random accesses.

We have simulated the performance of the swap parallelization for different chunk sizes with the DRAMSys tool [12]. For that, we created trace files that describe the access pattern to the DRAM. The speedup saturates at 2*.*5<sup>×</sup> for a chunk size of *<sup>N</sup>* = 12 parallel swaps. Since *<sup>N</sup>* is small, checking for collisions between swaps is much faster than writing the swapped edges back to DRAM, therefore it does not incur any time overhead.


Table 2. Cluster ASIC Comparison

*<sup>a</sup>*node including: ASIC with 1920 *coocc* modules, 28 nm; 48 GB DDR3 memory (×64 or <sup>×</sup><sup>8</sup> channels); board (ethernet, clocks), power supply. *<sup>b</sup>*

each node: 2×Intel Xeon X5680 @ 12 <sup>×</sup> 3.33 GHz, 32 nm; 48 GB DDR3 memory *<sup>c</sup>*node including: ASIC with 240 *coocc* modules, 28 nm; 8 GB DDR3 memory (×64 channel); board (ethernet, clocks), power supply.

# 4 Performance Comparison

For demonstrating the performance of our design we have calculated the similarity measures for the Netflix Prize data set<sup>6</sup>, specifically the good ratings (4 or 5 stars) from users to movies. The resulting graph has 17,769 movies, 478,615 users, and 56,919,190 edges. In this case, *V<sup>l</sup>* are the the movies, *V<sup>r</sup>* the users.

In practice, the number of swaps in the randomization process is chosen between <sup>|</sup>*nodes*<sup>|</sup> ln <sup>|</sup>*nodes*<sup>|</sup> = 6,259,639 and <sup>|</sup>*edges*<sup>|</sup> ln <sup>|</sup>*edges*<sup>|</sup> = 1,016,414,121. To demonstrate that our design qualifies for the full range, we compare it for both of those extremes. The exact number in practice usually depends on the

<sup>6</sup> Available at https://www.kaggle.com/netflix-inc/netflix-prize-data. Last accessed on 24/11/2022.

nature of the graph. A heuristic for determining the optimal number of swaps is discussed in Chapter 3.

Table 2 compares our ASIC and our optimized cluster implementations of the LA algorithm. The cluster implementation was developed specifically for this reference work and tested on two Intel Xeon X5680 @ 12 × 3.33 GHz, 32 nm server nodes. Optimization involved the selection of an algorithm that minimizes computing time for the given memory resources, removing locks by data partitioning, and data access linearization [4 SPP,3 SPP].

Our first ASIC design (240 *coocc* modules) has a runtime performance comparable to the cluster implementation if a low number of swaps is necessary. Notice, however, that it becomes almost useless (takes 20 days to complete) if <sup>|</sup>*edges*<sup>|</sup> ln <sup>|</sup>*edges*<sup>|</sup> swaps are required. This is clear since in this first architecture we only focused on accelerating the *coocc* calculation. Still, the total energy consumption is 10x lower (notice that the total energy takes into account the total runtime). This goes to show the amount of energy overhead for software implementations, or how much energy can be saved by task specific ASICs. This conclusion is interesting for both ends of the computing spectrum: The embedded computing systems that are limited by battery capacity, size, and power constraint, and the high performance computing, limited by energy expenses and power dissipation issues.

Our second design shows how reconfigurability can address data-dependent bottlenecks (i.e., the *coocc* or the edge swapping). By using smaller word-sizes (×8 channels), we can accelerate both the *coocc* and edge swapping in such a way that the Link Assessment becomes 45% faster than the cluster implementation while consuming 360x less energy. When fewer swaps are necessary, the word-size can be increased to <sup>×</sup>64 channels, further reducing the *coocc* computation time (the primary bottleneck), reaching a speed up of 5.6x compared to software. The total energy economy, in this case, is even more impressive: From 114 MJ in software to only 0.11 MJ in the custom design. This is partially due to the large reduction of 38x in main memory footprint, from 202 GB to 5.3 GB.

# 5 Conclusion

Further increasing computational performance in modern technologies has become a key challenge for the whole hardware and software industry. Phenomena such as *Dark Silicon* force system designers to move to highly heterogeneous systems, consisting of a large amount of highly dedicated hardware accelerators in combination with classical programmable architectures such as CPUs and GPUs. Since hardware accelerators focus on specific tasks, they can be much more power/energy and compute efficiently than the latter ones.

In this chapter, we present a hardware architecture for the Link Assessment (LA) algorithm, used for cleaning up noise data in large graphs. Processing and analyzing large graphs will remain a key application in HPC for the next decades. Since the current bottleneck for speeding up this task is fast random access to memory, with standard DRAM architectures and controllers on commodity HPC nodes we experience a hard performance limit, together with high energy consumption.

Our proposed architecture uses custom data structures and exploits bit-wise access to the data in order to overcome these limitations. On a 28 nm ASIC device with a DDR3 controller it is 1030x more energy efficient compared to a standard compute cluster, using 38x less memory in total. We show multiple optimization techniques that are specific to custom hardware designs, such as a slight memory controller modification that reduces the average random access latency; and a tailored cache design that enables scalable parallelism w.r.t. memory bandwidth. The architecture is fully flexible and can also be ported as an FPGA accelerator solution. This clearly illustrates the potential of hardware accelerators for the LA in particular and the graphs analysis domain in general.

Transferring the concepts to other algorithms such as Curveball (see Chapter 2) is the subject of ongoing work.

# References

	- 5. Duranton, M., et al.: Hipeac vision 2019. European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC) (2019)
	- 6. Dutoit, D., et al.: A 0.9 pJ/bit, 12.8 GByte/s WideIO memory interface in a 3D-IC NoC-based MPSoC. In: Symposium, VLSIT, pp. C22–C23. IEEE (2013)
	- 7. Esmaeilzadeh, H., Blem, E.R., Amant, R.S., Sankaralingam, K., Burger, D.: Dark silicon and the end of multicore scaling. IEEE Micro 32(3), 122–134 (2012). https://doi.org/10.1109/MM.2012.17
	- 8. Garraghan, P., Al-Anii, Y., Summers, J., Thompson, H., Kapur, N., Djemame, K.: A unified model for holistic power usage in cloud datacenter servers. In: UCC, pp. 11–19. ACM (2016). https://doi.org/10.1145/2996890.2996896
	- 9. Harish, P., Narayanan, P.J.: Accelerating large graph algorithms on the GPU using CUDA. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 197–208. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77220-0\_21
	- 10. Howard, J., et al.: A 48-core IA-32 message-passing processor with DVFS in 45 nm CMOS. In: ISSCC, pp. 108–109. IEEE (2010). https://doi.org/10. 1109/ISSCC.2010.5434077
	- 20. Williams, S., Waterman, A., Patterson, D.: Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Graph-Based Methods for Rational Drug Design**

Andre Droschinsky<sup>1</sup> , Lina Humbeck<sup>2</sup> , Oliver Koch<sup>3</sup> , Nils M. Kriege<sup>4</sup> , Petra Mutzel5(B) , and Till Schafer ¨ <sup>5</sup>

<sup>1</sup> Department of Computer Science, TU Dortmund University, Dortmund, Germany andre.droschinsky@tu-dortmund.de <sup>2</sup> Computational Chemistry, Medicinal Chemistry, Boehringer Ingelheim Pharma GmbH & Co. KG, Ingelheim am Rhein, Germany lina.humbeck@boehringer-ingelheim.com, lina.humbeck@tu-dortmund.de <sup>3</sup> Institute of Pharmaceutical and Medicinal Chemistry, and Center for Multiscale Theory and Computation, University of Munster, M ¨ unster, Germany ¨ oliver.koch@uni-muenster.de <sup>4</sup> Faculty of Computer Science, University of Vienna, Vienna, Austria nils.kriege@univie.ac.at <sup>5</sup> Institute for Computer Science, University of Bonn, Bonn, Germany

petra.mutzel@cs.uni-bonn.de, till.schaefer@uni-bonn.de

**Abstract.** Rational drug design deals with computational methods to accelerate the development of new drugs. Among other tasks, it is necessary to analyze huge databases of small molecules. Since a direct relationship between the structure of these molecules and their effect (e.g., toxicity) can be assumed in many cases, a wide set of methods is based on the modeling of the molecules as graphs with attributes.

Here, we discuss our results concerning *structural* molecular similarity searches and molecular clustering and put them into the wider context of graph similarity search. In particular, we discuss algorithms for computing graph similarity w.r.t. maximum common subgraphs and their extension to domain specific requirements.

**Keywords:** Drug discovery · Cheminformatics · Graph similarity · Molecular similarity · Maximum common subgraph · Maximum similar subgraph · Structural graph set clustering · Subgraph mining · Molecular library · BRD4

# **1 Introduction**

The era of big data has reached academic and industrial pharmaceutical drug research in the last decade, which has changed how drugs are developed. Nowadays, large collections of bioactivity data and large databases of potentially synthesizable molecules exist. Publically available bioactivity databases like ChEMBL [4] or Pubchem [25] contain over 16 million data points about molecules that modulate protein or drug target functions. This allows data-driven decision via in-depth data mining and knowledge discovery approaches, e.g., the identification of similar molecules for the prediction of a known protein target or unwanted side effects. The extraction of molecular features enables an increasingly reliable prediction of properties such as toxicity or oral availability.

The chemical space of drug-like molecules provides another source of big data. Theoretical analysis of the comprehensive chemical space estimates around 1062 molecules with a typical drug size. Among those, around 166 billion molecules are described by the chemical universe database GDB-17 that was build up using 17 standard atoms that occur within drugs [47]. The REAL space1, a large collection of commercially available chemical compounds, contains about 15.5 billion molecules that are potentially synthesizable. Finally, the current version of the ZINC database [55] contains over 750 million purchasable compounds that were already synthesized.

Several established tools and work-flows are available that utilize bioactivity data or the chemical space for the rational development of bioactive molecules [20 SPP]. These approaches are based on the common basic assumption that similar molecular structures have similar bioactivities. A classical approach for identifying similarity between molecules is to use molecular fingerprints that are fixed size vectorial representations of structural characteristics, e.g., extended-connectivity fingerprints [46]. Although this form of representation allows fast comparisons and the usage of fast vector-based tools, vectorization suffers from an information loss and can lead to inaccurate discrimination of similar molecules. This becomes a problem for the above described 'big' molecular databases since the available similarity measures do not discriminate enough. Thus, similarity searches in such databases contain too many false positives, which hampers further processing.

A more accurate comparison of molecules is directly based on the graph representation of the chemical structures. This representation allows the modeling of the molecules as graphs with attributes and the use of graph-theoretic concepts, algorithms and tools to analyze molecular databases. Figure 1 shows two similar molecules and their graph representation. The atoms are modeled as vertices, and the bonds between the atoms as edges. Attributes for the graph could be, e.g., a label for the vertices providing the atom name and a label for the edges encoding the binding type.

Unfortunately, the use of molecular graphs to compare two molecules based on the concept of isomorphism is notoriously much more time consuming compared to molecular fingerprint-based similarity search. Additionally, a comparison based on the maximum common substructure (maximum common subgraph) between two molecules may fail in the identification of molecules with similar chemical properties since the classical definition of a common substructure is too strict under some circumstances. Therefore, novel methods are urgently needed for the analysis of the still increasing amount of molecular data. The focus of the interdisciplinary project "Graph-based Methods for Rational Drug Design" has been the development of new structural approaches w.r.t. molecular similarity search and molecular clustering. This chapter presents some of the main results and puts them into a wider context of graph similarity.

Preliminaries and mathematical definitions are provided in Sect. 2. State-of-the-art methods for comparing graphs w.r.t. the size of their *Maximum Common Subgraph (MCS)* in the context of molecular graphs are discussed in Sect. 3. For drug design,

<sup>1</sup> https://enamine.net/library-synthesis/real-compounds.

**Fig. 1.** Two similar molecules (Sildenafil and Vardenafil) and their corresponding graphs. The colors display the atom types (nodes) and bond types (edges).

it is often advisable to preserve certain molecular substructures –such as rings, blocks, or bridges– in comparisons since they have special biochemical properties as a whole. This method for comparing molecules can be further improved by incorporating chemical knowledge about reasonable atom or substructure substitutions that presumably do not affect bioactivity considerably. Figure 1 shows an example of two drugs with atom substitutions in the bicyclic structure that do not affect bioactivity. In our model of similarity, it is allowed to change certain structures of the graphs and still mark them as *structurally equivalent*. Thus, we have introduced the *Maximum Similar Subgraph (MSS)* problem. Our findings, including algorithms and experimental results, are discussed in Sect. 3.2.

Clustering analysis is used for a variety of tasks in drug discovery. This includes complexity reduction, structure activity relationship reasoning in visual analytics, novelty analysis of de novo databases (see Sect. 5.3), diversity analysis, structured sampling, and many more. Cluster analysis on huge molecular databases is the topic in Sect. 4. First, we discuss computational and information-theoretic challenges before we present a scalable state-of-the-art structural clustering algorithm (*StruClus*) that tackles these challenges. In Sect. 5, we discuss some selected successful applications in rational drug design in the context of this priority program. In a scaffold-focused analysis of bioactivity data, we discovered an unexpected similarity in ligand binding between two important drug targets (BRD4 and PPARγ) in cancer therapy (cf. Sect. 5.2). This discovery was possible using Scaffold Hunter, an open-source tool developed in our group to support the drug discovery process (cf. Sect. 5.1). In Sect. 5.3, we present CH*I*PMUNK, a new virtual database of more than 95 million synthesizable small molecules. Using *StruClus*, it was possible to demonstrate the novelty of the database in comparison to existing molecular libraries.

**Fig. 2.** Example: A common subgraph *C* of the graphs *G* and *H*. Dashed arrows indicate the subgraph isomorphism.

# **2 Preliminaries**

An *undirected labeled graph G* = (*V,E,l*) consists of a finite set of *vertices V*(*G*) = *V*, a finite set of *edges E*(*G*) = *E* and a labeling function *l* : *V* -*E* → *L*, where *L* is a finite set of *labels*. An edge {*u,v*} connects two vertices *u,v* ∈ *V*, *u* = *v*. A (simple) *path* of length *n* is a sequence of vertices (*v*0*,...,vn*) such that {*vi,vi*<sup>+</sup>1} ∈ *E* and *vi* = *vj* for *i* = *j*, *i, j* = 0*,...,n*−1. A *tree* is a graph in which any two vertices are connected by a unique path. A graph is called *planar* if it admits a drawing on the plane without edge crossings, and it is *outerplanar* if such a drawing is possible in which every vertex lies on the boundary of the outer face.

For our similarity approaches based on subgraph isomorphisms, we need the following definitions. Let *G* and *H* be two undirected labeled graphs. A *(label preserving) subgraph isomorphism* from *G* to *H* is an injection ψ : *V*(*G*) → *V*(*H*), where ∀*v* ∈ *V*(*G*) : *l*(*v*) = *l*(ψ(*v*)) and ∀*u,v* ∈ *V*(*G*) : {*u,v*} ∈ *E*(*G*) ⇒ {ψ(*u*)*,*ψ(*v*)} ∈ *E*(*H*) ∧ *l*({*u,v*}) = *l*({ψ(*u*)*,*ψ(*v*)}). If there exists a subgraph isomorphism from *G* to *H*, we say *H supports G*, *G* is a *subgraph* of *H*, *H* is a *supergraph* of *G* or write *G* ⊆ *H*. If additionally {*u,v*} ∈ *E*(*G*) ⇐ {ψ(*u*)*,*ψ(*v*)} ∈ *E*(*H*) for all *u,v* ∈*V*(*G*), then ψ is an *induced* subgraph isomorphism. If there exists a subgraph isomorphism from *G* to *H* and from *H* to *G*, the two graphs are isomorphic. A *common subgraph* (cf. Fig. 2) of *G* and *H* is a graph *C* that is subgraph isomorphic to *G* and *H*. A *maximum common subgraph* (MCS) is a common subgraph of maximum size (vertices plus edges).

A graph *G* = (*V,E*) with |*V*| ≥ 3 is called *biconnected* if *G* \ {*v*} is connected for each *v* ∈ *V*. A maximal biconnected subgraph of a graph *G* is called *block*. An edge {*u,v*} ∈ *E*(*G*) not contained in any block of *G* is a *bridge*. A vertex *v* of *G* is called *cutvertex*, if *<sup>G</sup>* \ {*v*} consists of more connected components than *<sup>G</sup>*. A *BC-tree* BC*<sup>G</sup>* of a graph *G* consists of a node for each block and bridge in *G* and all the cutvertices of *G*. Two nodes (blocks or bridges) *b,b* in a BC-tree are connected through the path *bcb* if they share the cutvertex *c* ∈ *V*(*G*). Figure 3 exemplifies a graph and its BC-tree. Let *S* and *G* be graphs, and ψ : *V*(*S*) → *V*(*G*) be a subgraph isomorphism. Then ψ is *block-and-bridge-preserving* (BBP) if any two edges in different blocks in *S* map to different blocks in *G*, and each bridge in *S* maps to a bridge in *G*.

The *support* supp(*G,G* ) of a graph *G* over a set of graphs *G* is the fraction of graphs in *G* that support *G*. *G* is said to be *frequent* if its support is larger or equal than a *minimum support threshold supp*min. A frequent subgraph *G* is *maximal* if there exists no proper frequent supergraph of *G*.

**Fig. 3.** A connected graph (left side) and its BC-tree (right side). The BC-trees' block nodes are depicted as green squares; the bridge nodes as blue squares. The white filled circles are the cutvertices. The associated subgraphs of *G* are depicted above the blocks and bridges. (Color figure online)

# **3 Molecular Similarity Based on Graphs**

An essential criterion of molecular similarity in drug design is not only the similarity in chemical structure but also the similarity in biological activity or bioactivity. In order to obtain molecular similarities meeting this requirement, we introduce a graph-based method, which addresses the following problem.

**Definition 1.** *Given two molecular graphs G and H, the* maximum similar subgraph problem *is to find chemical meaningful subgraphs of G and H with equivalent bioactivity.*

Starting from this informal description, we introduce clearly defined graph-theoretical problems extending the maximum common subgraph paradigm. Since scalability is a critical concern, algorithmic aspects and complexity results must be taken into account and related to the specific properties of molecular graphs. These graphs are almost always planar and often outerplanar [18]. Since the number of bonds per atom is limited, the vertex degrees are bounded. It can be observed that all the graphs representing small molecules have a small tree width. The *tree width* of a graph essentially measures the similarity of a graph to a tree structure. Trees have tree width 1, and graphs that can be constructed via parallel or serial merges (series-parallel graphs) have tree width 2. Typically, molecular graphs have vertex and edge attributes that are either discrete labels or numerical values.

We proceed with a discussion of similarity approaches based on the maximum common subgraph paradigm and the specific challenges when applied to molecular graphs. Then, new graph-based methods are introduced, which address these challenges as part of the maximum similar subgraph problem.

# **3.1 Challenges and Approaches in Comparing Molecular Graphs**

The *maximum common subgraph problem* is to find a common subgraph in two given graphs of maximum size. In the domain of cheminformatics, the maximum common subgraph problem has been extensively studied [12,44,50]; see [28 SPP] for a recent survey. In this domain, it is often referred to as the maximum or *largest common substructure problem*. This problem is known to be *N P*-hard. With trees as input and output, the problem was shown to be polynomial-time solvable [35], but bioactive molecular graphs are not trees in general. The fact that they are mostly outerplanar does not directly lead to efficient algorithms since the maximum common subgraph problem restricted to outerplanar graphs remains *N P*-hard. Instead of developing maximum common subgraph algorithms for more general graph classes, which has been proven difficult, a different approach represents molecules simplified as trees [41]. Then, vertices typically represent groups of atoms, and their comparison requires rating the similarity of two vertices by a weight function. However, similar to fingerprints, this goes along with a loss of information. Especially when comparing to large molecular databases, e.g., to rank the molecules regarding their similarity, this loss of information can lead to a reduced distinctiveness [21 SPP].

For molecular graphs, there is a variation of the maximum common subgraph problem of high practical relevance. There, the block (i.e.,connected set of molecular rings) and bridge (i.e.,molecular chain) structure of the input graphs must be retained by the common subgraph, i.e.,the underlying subgraph isomorphism is *block-and-bridge preserving* (BBP). This variation is denoted *block-and-bridge preserving maximum common subgraph problem* (BMCS) and requires the common subgraph to be connected and the associated subgraph isomorphisms to be BBP. There is a variant of the problem where the subgraphs are not necessarily (vertex) induced. This edge induced variant is denoted as *BMCES*. For both variants, it has been shown that they yield meaningful results for cheminformatics and are computable in polynomial time on outerplanar graphs [50,21 SPP,10 SPP].

In [50], a BMCES algorithm was proposed for outerplanar molecular graphs. Contrary to the original claim of *O*(*n*2*.*5) for a graph with *n* vertices, the algorithm allows no better bound than *O*(*n*4) on its running time [30 SPP]. A previously suggested algorithm regarding the BMCS problem for input graphs with tree width *k* ≤ 2 has a running time of *O*(*n*6) [32 SPP]. In the case of outerplanar input graphs, the running time can be reduced to *O*(*n*5). An essential part of this algorithm is the decomposition of the graphs into their *BC-* and *SPQR-trees*, which decompose the graphs into their biconnected and three-connected components. A maximum solution is then computed via a dynamic programming approach on the blocks and bridges.

Following the above result, we presented a faster approach tailored to outerplanar graphs [10 SPP]. On such graphs *G* and *H*, this algorithm achieves a running time of *O*(|*G*||*H*|Δ(*G,H*)), where Δ(*G,H*) = 1 if *G* or *H* is biconnected; otherwise, Δ(*G,H*) = min{Δ*C*(*G*)*,*Δ*<sup>C</sup>*(*H*)}, where Δ*<sup>C</sup>*(*G*) and Δ*C*(*H*), respectively, is the maximum degree of all cutvertices in *G* and *H*, respectively. For outerplanar molecular graphs, the time bound is *O*(|*G*||*H*|) since they have bounded degree. The first major ingredient is a fast dynamic programming approach on the BC-trees of the input graphs, where we exploit the similarity between the maximum weight matching instances that we have to solve [9 SPP]. Here, we use an algorithm for the maximum weight matching problem with a running time depending on the smaller vertex set. The second ingredient is a quadratic time algorithm to find a biconnected maximum common subgraph between two blocks *b*<sup>1</sup> and *b*2. This is realized by enumerating all maximal (with respect to inclusion) biconnected common subgraphs between the two blocks. Each maximal solution *C* can be computed in time *O*(|*C*|). The total size of all maximal solutions per block pair(*b*1*,b*2)is *O*(|*b*1||*b*2|); hence the total algorithm's running time is *O*(|*G*||*H*|). Along the edges and vertices with different labels, the maximal solutions are split into smaller biconnected components. Among all those components, we keep one of maximum size.

For non-outerplanar graphs, we use a clique reduction to compute biconnected maximum common subgraphs between two blocks if at least one is not outerplanar. In the reduction, we enumerate c-cliques as presented in [8,27]. Among them, we keep a biconnected c-clique of maximum size. This approach reduces the practical running time compared to a pure clique-based algorithm operating on the whole graphs since the computational demanding clique problem must be solved for small components only. In contrast to the BMCES algorithm of [50], the above-described technique enables our algorithm to compute a solution for any two molecular graphs and lower the practical running time for graphs with multiple blocks, even if they are not outerplanar.

We evaluated the practical running time of our algorithm [10 SPP] by comparing it to the BMCES algorithm from [50]. In our experiments, we used a dataset of 29 000 randomly chosen pairs of outerplanar molecular graphs from the NCI Open Database, GI502, with an average of 22 vertices (atoms) and a maximum of 104 vertices. Our algorithm outperformed the competitor by a factor of 84 on average. The experimental results align with our theoretical correction [30 SPP] of the running time analysis given in [50]. It should be noted that the BMCES algorithm is already much faster than a general clique-based MCS algorithm [50]. Our BMCS algorithm outperforms such a general algorithm by several orders of magnitudes. The practical differences of the results w.r.t. the vertex and edge induced variants is marginal, and we observed a disagreement in only 0.4% of the comparisons.

While our basic BMCS algorithm is fast in theory and practice, the primary goal is to find a meaningful common subgraph. It was observed that allowing disconnected common subgraphs improves the validity, given that the connected components are arranged consistently in both graphs [34,51]. However, solving the general disconnected variant is *N P*-hard even in trees. Moreover, small variations of the chemical elements (vertex labels) might be tolerated. We tackle these challenges in the next subsection.

#### **3.2 Maximum Similar Subgraph Based Similarities for Molecules**

This subsection presents several problem fields where the classical MCS definition is too strict w.r.t. molecular bioactivity. We show how these problem fields can be theoretically approached under the MSS definition and how they can be solved programmatically by integration in the MCS algorithms. Subsequently, we evaluate our MSS approach in comparison with several established molecular similarity measures.

From a chemical point of view, the two drugs shown in Fig. 1 are almost identical and are expected to have nearly identical properties w.r.t. bioactivity. However, an MCSbased comparison would interpret a large part of the molecules as different due to the nitrogen switch in the bicyclic ring system. In other words, the exchange of a nitrogen and carbon atom in an aromatic ring should influence the molecular similarity only to a small extend under the maximum similar subgraph problem definition. In addition,

<sup>2</sup> http://cactus.nci.nih.gov.

**Fig. 4.** Molecular graphs of Melphalan (top) and Chlorambucil (bottom). The BMCS on the left (red) maps less vertices than the BMCS embedding on the right (blue, green). The atoms on the right side (O, O, H) may be added to the embedding by mapping the green paths to each other. (Color figure online)

atom types like aromatic nitrogen or carbon can be grouped by their properties and such atom type groups can be used as representation instead. Thus, by softening the matching constraints in the MSS problem, a much larger substructure should be identified in the two molecules in comparison to the MCS approach. This problem can be solved with an atom type group representation [39] and a score in the range [0*,*1]∪ {−∞} to group mappings (mapping of vertices with atom type group labels), where {−∞} forbids the mapping. Hence, the objective is to maximize the weight of all mapped groups instead of the number of mapped vertices. The complete weight matrix is listed in Table III.2.3 of [19].

Additionally, we allow the mapping of disjoint paths of bridges (more precisely, the path's endpoints while skipping the inner vertices) to each other [11 SPP] in our MSS approach, i.e.,we allow some kind of disconnection. We denote this technique *embedding*, following [17]. This is useful, e.g., if two molecules differ only in the length of a chain connecting similar or identical substructures. To prevent arbitrary long paths, we introduce a linear penalty depending on the length of such paths. An example of two molecular graphs that profit from the described approach is depicted in Fig. 4.

In summary, we developed an algorithm applicable to molecular graphs that addresses the maximum similar subgraph problem by *(i)* using the established BMCS concept, *(ii)* allowing disconnectivity by mapping paths to edges, and *(iii)* supporting weight functions between labels. Moreover, our algorithm is efficient in theory and practice for the vast majority of molecular graphs.

In order to evaluate the quality of our MSS approach, we used a similar setup as in [40] and compared it to state-of-the-art chemical fingerprint methods. Our main question was whether the MSS approach produces meaningful results when used to rank molecules. In the following, we present the key evaluation results for the singleassay benchmark, which consists of rather similar molecules that have been ranked by the authors w.r.t. decreasing order of activity.

First, we analyzed different layers to represent the molecules. Among them are the chemical elements representation (e.g., N for nitrogen) and the file conversion (fconv) atom type groups [39]. We discovered that the latter representation based on the weight matrix of Table III.2.3 of [19] performed best for the single-assay benchmark. As similarity coefficient, we used Bunke and Shearer's [43], which performed best among the tested ones. It is defined as *W/*max{*k,l*}, where *W* is the weight of the maximum common subgraph, and *k*, *l* are the sizes of the input graphs.

Compared to other methods, the very popular ECFP4 fingerprint showed the best match with the reference ranking followed by our MSS embedding approach. This is followed by RDKit7 (fingerprint of all subpaths up to path length 7), MSS without embedding, RDKit6, and the BMCS approach. Other fingerprint methods ranked in between. Extended-Connectivity Fingerprints (ECFPs) capture the neighborhood of the non-hydrogen atoms in circular layers up to a given diameter (e.g., 4 in the case of ECFP4). Thus, their features, similarly to the MSS, also represent the presence of particular small substructures. However, the advantage of our MSS approach is that it explicitly computes the similar substructures of the molecules and a concrete mapping between the atoms (vertices). It also achieved a high distinctiveness between the results, which is important to virtually screen large (big data) molecular libraries. The additional feature of mapping disjoint paths to each other showed improved results on the ranking benchmarks. More detailed results, as well as additional tests, can be found in [19].

# **4 Clustering Analysis**

As mentioned in the introduction, clustering is used for a variety of use cases in drug discovery. In the following, we will focus on the task to cluster large scale molecular datasets of labeled graphs. An application of the presented approach is given in Sect. 5.3.

**Definition 2.** *A* clustering *of a graph dataset —i.e.,a multiset of labeled graphs— X is a partition C* = {*C*1*,...,Cn*} *of X , that maximizes cluster homogeneity and often separation.*

The concrete definitions of homogeneity and separation differ in different clustering methods. Common measures for homogeneity are diameters or radii, density, or relative closeness to cluster representatives. Separation is often defined over the minimum distance between cluster elements or some aggregated cluster features. In contrast to homogeneity, separation is not always considered by clustering algorithms. For example, it is challenging to find a suitable definition of separation for projected clustering algorithms, since each cluster is linked to its own subspace and by that is incomparable to other clusters. Meta algorithms can be used to tune some of the clustering algorithms that do not optimize separation directly to achieve well-separated clusterings. For example, the number of clusters can be used as such a tuning parameter for the classical k-means algorithm [56].

#### **4.1 Challenges and Approaches in Molecular Cluster Analysis**

A major design decision for clustering algorithms is the data representation. Most classical clustering algorithms rely on vectorial data interpreted as points in some predefined space (e.g., *R*\ with *l* 2-norm) or, more generally, on pairwise distances or kernels. Exchangeable distances or kernels are very versatile since it allows the clustering algorithm to be adapted to the specific clustering task. However, the explicit vector space representation with a fixed norm is often beneficial in terms of computational complexity. For example, it allows the explicit calculation of centroids, easy extraction of subspaces, or the use of binning. With these methods, it is often possible to avoid calculating a quadratic number of pairwise distances during the clustering process.

To fit a graph dataset into this models, the graphs must be either transformed into vectors (e.g., by using structural fingerprints or Weisfeiler-Lehman features [38]) or kernels/distances must operate directly on graph data (e.g., graph kernels [31] or the distance given in Sect. 3). However, while preferable in terms of generality, these generalized methods have weaknesses in the discussed domain.

First, both methods tend to produce (intrinsic) high-dimensional datasets [29]. While a high dimensionality may even be beneficial in supervised learning, intrinsic high dimensional datasets are linked to the so-called concentration effect [5] in the unsupervised setting. This effect causes the pairwise distances to lose their relative contrast, i.e.,the distances converge towards a common value. The concentration effect is closely related to a bad clusterability [1,57]. Furthermore, it causes metric index structures to be inefficient. Subspace or projected clustering methods, which are usually used in such a setting, come with an extra computational burden and are usually limited to vector space.

Second, the transformation to reasonably sized vectors is lossy and non-reversible. This causes the clustering results to be hard to interpret since cluster features, centroids, or subspaces are not in the application domain. Thus, these methods fail to provide a domain specific explanation about cluster commonalities.

As a consequence of these issues, structural clustering methods have been developed, which provide cluster descriptions or interpretations directly in the graph domain. This is accomplished by various constructs, including subgraph isomorphisms, (maximum) common subgraphs [6], frequent subgraphs [57], graph edit operations [23], and set medians [14,23]. For example, a cluster description can be given in the form of common subgraphs. Since most of these sub-problems are themselves challenging *N P*-hard problems, structural clustering algorithms are often limited to small datasets (e.g., [14,23,58]) or very special graph classes (e.g., trees [3]). As a consequence of the computational complexity, some of these clustering algorithms are hybrid approaches, which utilize approximations in vector space in order to map the results back into the graph domain. For example, the clustering algorithms in [14,23] calculate a cluster median in vector space but assign graphs to clusters w.r.t. the graph edit distance. A hierarchical k-means clustering in vector space is used as a starting point in [6]. It is later refined in order to increase the size of the common substructures. To the best of our knowledge and besides our own work, the only structural clustering algorithm for larger-scale datasets of general labeled graphs is presented in [54]. In this algorithm, each partition element of a vectorial pre-clustering is further partitioned with a struc-

**Fig. 5.** Real world clusters with representatives (grey boxes) generated by *StruClus*. Colors represent node labels. (Color figure online)

tural algorithm. The pre-clustering is designed to only separate graphs that are also separated by the structural variant with a high probability if the structural clustering would be applied to the whole dataset.

#### **4.2** *StruClus***: Scalable Structural Graph Set Clustering**

*StruClus* [49 SPP] is a structural projected clustering algorithm that is tailored towards our setting of large-scale datasets ( 106 graphs) of small labeled graphs (druglike molecules are limited in their maximum size for biological reasons). Its linear runtime w.r.t. to the dataset size, the usage of various sampling strategies, and a parallelizable algorithm design make *StruClus* scalable and very fast in practice. It incorporates homogeneity and separation constraints for high-quality results.

A central concept of *StruClus* is the usage of cluster representatives sets *R*(*C*) for each *C* ∈ *C* (cf. Fig. 5 for a real-world example) that contain frequent subgraphs of the cluster members. They are beneficial in terms of computational complexity since they enable graph-cluster comparisons without looking at the cluster members (similar to the concepts of centroids or medoids). Additionally, they lead to human interpretable clusters by explaining the cluster content in the application domain.

The main objective of *StruClus* is to maximize homogeneity in the sense that the large fraction of the nodes and edges of the cluster members are covered by some subgraph isomorphism from the representatives. Similar to the classical k-means algorithm, this is achieved by an iterative optimization procedure that updates the representatives and re-assigns the cluster members to the best fitting cluster. However, the number of clusters is not pre-defined but adapted to the dataset structure with the help of cluster splitting operations on inhomogeneous clusters. Additionally, clusters with similar representatives are merged in order to maintain a well separated clustering.

Performance-wise, the major challenge lies in the discovery of suitable representatives *R*(*C*) for each cluster*C*. Since the number of frequent subgraphs may be exponential w.r.t. the maximal graph size in the cluster, *StruClus* utilizes a randomized maximal frequent subgraph sampling method. This is implemented by a random exploration of the frequent subgraphs of each *C*, which form a meet-semilattice with partial ordering derived from the sub and supergraph relation (cf. Fig. 6). Each random exploration starts

**Fig. 6.** Example for a meet-semilattice of subgraphs ordered by the subgraph isomorphism relation. Node colors indicate labels. Maximal frequent subgraphs are marked with a blue background color. (Color figure online)

with the empty graph and moves up in the lattice until a maximum frequent subgraph is reached. Since the support is monotonically decreasing for the supergraph relationship, it is possible to prune the search space with the minimum support threshold *supp*min.

The above-described maximum frequent pattern sampling is complemented with a new error bound stochastic sampling strategy over the cluster members to determine whether a graph pattern is frequent. A subset of the maximal frequent subgraphs given by this twofold sampling procedure is then selected by ranking the frequent subgraphs w.r.t. to the above homogeneity criteria.

In comparison with structural clustering competitors, such as [14,23,53,54,58], *StruClus* is able to raise the maximum dataset size by multiple orders of magnitude, reaching into the domain of large-scale *de novo* databases. At the same time, *StruClus* outperforms structural competitors with a suitable performance for medium to largescale datasets in terms of quality. Figure 7 shows an extract of an in-depth evaluation given in [49 SPP] w.r.t. to quality and performance on a real-world dataset (heterocycle) and a synthetic dataset. The heterocycle dataset consists of composed molecules classified by their reaction types. The synthetic dataset has common subgraphs for a class of graphs and is used to perform analysis with varying parameters. In Sect. 5.3, we present a real-world use case of *StruClus*.

# **5 Rational Drug Design Applications**

In this section, we present successful applications of the above approaches. Additionally, we present our tool Scaffold Hunter, that brings the scientific findings into the realm of practical drug design.

#### **5.1 Scaffold Hunter**

Scaffold Hunter [26,48 SPP] is open-source software for the analysis and visualization of molecular data with the aim to support the user in elucidating structure-activityrelationships. To this end, it features several structural classification schemes with dedicated visualizations and techniques to indicate chemical properties such as biological activity, e.g., by mapping values to colors, cf. Fig. 8. A fundamental structure-based concept is based on common core structures, so-called *scaffolds*, which can be organized hierarchically in a scaffold tree [52]. This approach forms the basis for several

**Fig. 7.** *StruClus* evaluation in comparison with SCAP [54], Proclus [2], and Kernel K-Means [16]. Graphlet –i.e.,small induced subgraph– frequencies are used for Proclus and Kernel K-Means.

views, which show the scaffold tree in a radial layout, in the form of a tree map or a set of scaffolds as a molecule cloud [13]. The view is inspired by the popular *word cloud* method, where the importance of words is indicated by their size. Here, scaffolds are scaled according to the number of molecules in the dataset containing them.

Following a different concept, structure-based hierarchical clustering is supported by means of chemical fingerprint similarity. Specifically for very large data sets, we have developed a heuristic method based on metric indexing [29]. The result can be visualized as a dendrogram that can be linked to a table or a heatmap. The heatmap visualizes property values in a matrix using color coding, where the columns are ordered in accordance with the dendrogram. This allows identifying whether chemical properties align with the structural similarity.

Several publications have shown that Scaffold Hunter is useful in various research tasks such as scaffold hopping, target prediction, chemical space analysis, and natural product simplification [7,26,33,45,21 SPP].

(a) Scaffold tree and dendrogram view (b) Heatmap, treemap and cloud view

**Fig. 8.** Scaffold Hunter allows to visualize molecular data in various linked views.

**Fig. 9.** Co-crystal structure of BRD4 in complex with one of the identified novel inhibitors (6g0e@pdb).

#### **5.2 BRD4**

In this study, an unexpected similarity in ligand binding between the bromodomaincontaining protein 4 (BRD4) and the peroxisome-proliferator activated receptor gamma (PPARγ) was identified. Both are important drug targets in cancer therapy, cardiovascular diseases, and inflammation processes [15,24]. The starting point was a scaffoldfocused analysis of bioactivity data using the command-line version of Scaffold Hunter [48 SPP]. This analysis revealed a bicyclic scaffold that can be found, amongst others, in known ligands for BRD4 and PPARγ. Compounds with similarity to known PPARγ ligands were subsequently selected and tested on BRD4. Interestingly, the hit rate, which means the number of actives on BRD4, was unexpectedly high. Some of the novel inhibitors were successfully co-crystallized. One example is shown in Fig. 9. Further analyses of both proteins support the discovery of an unexpected relationship between the two drug targets [21 SPP] because they also show a high similarity of their binding sites. Based on this result, it seems possible to develop a drug that modulates both proteins with synergistic effects. Such a dual modulator would have the potential to have implications for the prevention or treatment of resistances against BRD4 inhibitors, which could already be observed [42]. Thus, this study demonstrates the successful application of a graph-based method in a prospective drug discovery study.

**Fig. 10.** Per cluster database distribution for the novelty analysis of CH*I*PMUNK. The green share is the MCR-CH*I*PMUNK sublibrary, blue is ChEMBL, red are commercially available compounds. The plot shows that some clusters are (almost) exclusively covered by CH*I*PMUNK. [taken from [22 SPP], printed with permission from Wiley] . (Color figure online)

#### **5.3 Chipmunk**

CH*I*PMUNK (CHemically feasible *In silico* Public Molecular UNiverse Knowledge base) [22 SPP] is a novel virtual library of small molecules which are synthesizable from purchasable reactants. The goal of such *de novo* libraries is the expansion of the known chemical and bioactivity space in order to enable virtual analytical processes to extract meaningful novel molecular structures, e.g., for drug discovery. The *in silico* simulated reactions are chosen such that they are synthesizable in reality with a high probability. Altogether, CH*I*PMUNK covers over 95 million compounds.

In the evaluation of CH*I*PMUNK, it was shown that the content of the library has interesting chemical properties and that the library covers previously undiscovered regions of the chemical and bioactivity space. The former aspect was analyzed using descriptor-based methods. It revealed that CH*I*PMUNK nicely covers the physicochemical space of protein modulators and protein-protein interaction modulators. *StruClus* (cf. Sect. 4.2) was used for the evaluation of the latter aspect, the novelty analysis. Additionally, *StruClus* itself was evaluated to prove that it creates useful clusterings w.r.t. to chemical properties (refer to [22 SPP] for further details). Thus, molecules of the same cluster exhibit similar chemical and biological properties with a high probability.

To analyze the novelty of CH*I*PMUNK, several libraries of commercially available compounds (ZINC [55], MolPort3, and eMolecules4) as well as the large scale ChEMBL [4] bioactivity database were clustered in conjunction with CH*I*PMUNK. The former libraries serve as known chemical space, whereas the latter serves as known bioactivity space. The clustering revealed a large portion of clusters consisting purely of CH*I*PMUNK compounds (cf. Fig. 10 for an example).

Thus, it was displayed that CH*I*PMUNK encompasses regions that are uncovered by existing databases but yet with protein modulator or protein-protein interaction mod-

<sup>3</sup> https://www.molport.com/.

<sup>4</sup> https://www.emolecules.com/.

**Fig. 11.** The Cover Feature shows three chipmunks involved in the creation, analysis, and clustering of the synthesizable virtual molecule library CH*I*PMUNK. Nearly 100 million compounds were generated with *in silico* reactions on accessible building blocks, and their descriptor profile was analyzed. [taken from [22 SPP], printed with permission from Wiley]

ulator like physicochemical properties. It can be concluded that CH*I*PMUNK offers the potential to contain future drugs.

The CH*I*PMUNK library is publicly available together with the clustering results. Areas of the chemical space –i.e.,clusters– that overlap with the ChEMBL library can be used to relate novel molecules given in CH*I*PMUNK to already known molecules from ChEMBL (in terms of structural similarity). This is helpful to relate already existing knowledge to the CH*I*PMUNK library. Thus, it may help in identifying the biological targets for the CH*I*PMUNK compounds.

# **6 Conclusion and Outlook**

Graph-based methods for the analysis of molecular data sets are particularly appealing because they can reveal subtle structural differences and allow interpretation in terms of substructures. The complexity of the related graph-theoretical problems, however, makes their applications to large data sets challenging. We have developed new methods based on common substructures, which take the specific constraints in cheminformatics into account and exploit the properties of molecular graphs. Thereby, our techniques become efficient in both theory and practice. The application to molecular similarity search shows that our approach produces chemically meaningful rankings of molecules. Thus, it is well suited for virtual screening in large molecular databases. Moreover, we have developed a structural clustering algorithm, which represents clusters by common substructures and scales to very large databases with millions of molecules. Our methods have been proven to be useful in various research tasks in rational drug design. The success of our approaches has also been appreciated in 2018, when we were invited to the cover feature of the June issue of ChemMedChem (cf. Fig. 11).

During the writing of this survey, our project is still ongoing. Currently, we develop a distributed algorithm to mine representative sets of subgraphs for a variety of different use cases, including but not limited to the development of a fully distributed structural clustering algorithm. For this, the discussions and results within the SPP have been very useful (cf. Chap. 14).

Within our project, we have also developed other approaches to algorithmic data analysis. E.g., we have studied Graph Neural Networks (GNN) and their use to generate molecular representations for application in virtual screening approaches. Here, GNNs performed worse than fingerprint-based multilayer perceptrons, which questions the use of simple GNNs to obtain molecular representations [36 SPP,37 SPP]. Future work will show if more complex graph-based representations will be able to replace molecular fingerprints as suitable input. For these learning approaches, it will be helpful to also learn with large generated graph families (cf. Chap. 2 and Chap. 3). Together with Christian Schulz, we investigate the applicability of kernelization (cf. Chap. 5), i.e., the iterative reduction of the problem to smaller instances, to common subgraph problems in large graphs. Matching problems build a connection with Chap. 13, which also is concerned with life science applications. Jointly we have worked on new streaming algorithms approximating the bipartite matching problem.

**Acknowledgement.** This work was supported by the German Research Foundation (DFG), priority programme *Algorithms for Big Data (SPP 1736)*. At the start of this project, all authors were members at TU Dortmund University. Permission for reprinting Fig. 11 and Fig. 10 has been permitted by Wiley.

# **References**

	- 12. Ehrlich, H.C., Rarey, M.: Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. Wiley Interdisc. Rev. Comput. Molec. Sci. **1**(1), 68–79 (2011). https://doi.org/10.1002/wcms.5
	- 13. Ertl, P., Rohde, B.: The molecule cloud compact visualization of large collections of molecules. J. Cheminf. **4**(1), 12 (2012). https://www.jcheminf.com/content/4/1/12
	- 14. Ferrer, M., Valveny, E., Serratosa, F., Bardaj´ı, I., Bunke, H.: Graph-based *k*-means clustering: a comparison of the set median versus the generalized median graph. In: Jiang, X., Petkov, N. (eds.) CAIP 2009. LNCS, vol. 5702, pp. 342–350. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-03767-2 42
	- 15. Ferri, E., Petosa, C., McKenna, C.E.: Bromodomains: structure, function and pharmacology of inhibition. Biochem. Pharmacol. **106**, 1–18 (2016). https://doi.org/10.1016/j. bcp.2015.12.005
	- 16. Girolami, M.A.: Mercer kernel-based clustering in feature space. IEEE Trans. Neural Netw. **13**(3), 780–784 (2002). https://doi.org/10.1109/TNN.2002.1000150
	- 17. Gupta, A., Nishimura, N.: Finding largest subtrees and smallest supertrees. Algorithmica **21**, 183–210 (1998). https://doi.org/10.1007/PL00009212
	- 18. Horvath, T., Ramon, J., Wrobel, S.: Frequent subgraph mining in outerplanar graphs. ´ Data Min. Knowl. Disc. **21**(3), 472–508 (2010). https://doi.org/10.1007/s10618-009- 0162-1
	- 19. Humbeck, L.: Betrachtung der Ahnlichkeit von niedermolekularen Verbindungen unter ¨ Berucksichtigung der biologischen Aktivit ¨ at. Dissertation, TU Dortmund University ¨ (2019)
	- 23. Jouili, S., Tabbone, S., Lacroix, V.: Median graph shift: a new clustering algorithm for graph domain. In: 20th International Conference on Pattern Recognition, pp. 950–953 (2010). https://doi.org/10.1109/ICPR.2010.238
	- 29. Kriege, N., Mutzel, P., Schafer, T.: Practical SAHN clustering for very large data sets ¨ and expensive distance metrics. J. Graph Algor. Appl. **18**(4), 577–602 (2014). https:// doi.org/10.7155/jgaa.00338
	- 31. Kriege, N.M., Johansson, F.D., Morris, C.: A survey on graph kernels. Appl. Netw. Sci. **5** (2020). https://doi.org/10.1007/s41109-019-0195-3
	- 33. Lachance, H., Wetzel, S., Kumar, K., Waldmann, H.: Charting, navigating, and populating natural product chemical space for drug discovery. J. Med. Chem. **55**(13), 5989– 6001 (2012). https://doi.org/10.1021/jm300288g, pMID: 22537178
	- 34. Marialke, J., Korner, R., Tietze, S., Apostolakis, J.: Graph-based molecular alignment ¨ (GMA). J. Chem. Inf. Model. **47**(2), 591–601 (2007). https://doi.org/10.1021/ci600387r
	- 35. Matula, D.W.: Subtree isomorphism in *O*(*n*5*/*2). In: Alspach, B., Miller, D.P.H. (eds.) Algorithmic Aspects of Combinatorics, Annals of Discrete Mathematics, vol. 2, pp. 91– 106. Elsevier (1978). https://doi.org/10.1016/S0167-5060(08)70324-8
	- 38. Morris, C., Rattan, G., Mutzel, P.: Weisfeiler and Leman go sparse: towards scalable higher-order graph embeddings. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) NeurIPS (2020). https://proceedings.neurips.cc/paper/2020/hash/ f81dee42585b3814de199b2e88757f5c-Abstract.html
	- 39. Neudert, G., Klebe, G.: fconv: format conversion, manipulation and feature computation of molecular data. Bioinform. **27**(7), 1021–1022 (2011). https://doi.org/10.1093/ bioinformatics/btr055
	- 40. O'Boyle, N., Sayle, R.: Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. **8** (2016). https://doi.org/10.1186/s13321-016-0148-0
	- 41. Rarey, M., Dixon, J.S.: Feature trees: a new molecular similarity measure based on tree matching. J. Comput.-Aided Molec. Des. **12**, 471–490 (1998). https://doi.org/10.1023/ A:1008068904628
	- 50. Schietgat, L., Ramon, J., Bruynooghe, M.: A polynomial-time maximum common subgraph algorithm for outerplanar graphs and its application to chemoinformatics. Ann. Math. Artif. Intell. **69**(4), 343–376 (2013). https://doi.org/10.1007/s10472-013-9335-0
	- 51. Schmidt, R., Krull, F., Heinzke, A.L., Rarey, M.: Disconnected maximum common substructures under constraints. J. Chem. Inf. Model. **61**(1), 167–178 (2021). https://doi. org/10.1021/acs.jcim.0c00741, pMID: 33325698
	- 52. Schuffenhauer, A., Ertl, P., Roggo, S., Wetzel, S., Koch, M.A., Waldmann, H.: The scaffold tree - visualization of the scaffold universe by hierarchical scaffold classification. J. Chem. Inf. Model. **47**(1), 47–58 (2007). https://doi.org/10.1021/ci600338x
	- 53. Seeland, M., Berger, S.A., Stamatakis, A., Kramer, S.: Parallel structural graph clustering. In: ECML/KDD, Athens, Greece, pp. 256–272 (2011). https://doi.org/10.1007/978- 3-642-23808-6 17
	- 54. Seeland, M., Karwath, A., Kramer, S.: Structural clustering of millions of molecular graphs. In: Symposium on Applied Computing, SAC 2014, pp. 121–128. ACM, Gyeongju (2014). https://doi.org/10.1145/2554850.2555063
	- 55. Sterling, T., Irwin, J.J.: Zinc 15-ligand discovery for everyone. J. Chem. Inf. Model. **55**(11), 2324–2337 (2015). https://doi.org/10.1021/acs.jcim.5b00559
	- 56. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. Roy. Stat. Soc.: Series B (Stat. Methodol.) **63**(2), 411– 423 (2001). https://doi.org/10.1111/1467-9868.00293, https://rss.onlinelibrary.wiley. com/doi/abs/10.1111/1467-9868.00293
	- 57. Tsuda, K., Kudo, T.: Clustering graphs by weighted substructure mining. In: ICML, pp. 953–960. ACM (2006)
	- 58. Tsuda, K., Kurihara, K.: Graph mining with variational dirichlet process mixture models. In: SDM, pp. 432–442. SIAM (2008). https://doi.org/10.1137/1.9781611972788.39

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Recent Advances in Practical Data Reduction**

Faisal N. Abu-Khzam1, Sebastian Lamm2, Matthias Mnich3, Alexander Noe4, Christian Schulz5(B) , and Darren Strash6

<sup>1</sup> Lebanese American University, Beirut, Lebanon faisal.abukhzam@lau.edu.lb <sup>2</sup> Karlsruhe Institute of Technologie, Karlsruhe, Germany sebastian.lamm@kit.edu <sup>3</sup> Hamburg University of Technology, Institute for Algorithms and Complexity, Hamburg, Germany matthias.mnich@tuhh.de <sup>4</sup> University of Vienna, Vienna, Austria alexander.noe@univie.ac.at <sup>5</sup> Heidelberg University, Heidelberg, Germany christian.schulz@informatik.uni-heidelberg.de <sup>6</sup> Hamilton College, New York, USA dstrash@hamilton.edu

**Abstract.** Over the last two decades, significant advances have been made in the design and analysis of fixed-parameter algorithms for a wide variety of graphtheoretic problems. This has resulted in an algorithmic toolbox that is by now well-established. However, these theoretical algorithmic ideas have received very little attention from the practical perspective. We survey recent trends in data reduction engineering results for selected problems. Moreover, we describe concrete techniques that may be useful for future implementations in the area and give open problems and research questions.

**Keywords:** Data reduction · Kernelization · Fixed-parameter algorithms · Algorithm engineering

# **1 Introduction**

Many important real-world optimization problems are NP-hard: it is believed that no polynomial time algorithm exists that always finds an optimal solution. However, many NP-hard problems have been shown to be fixed-parameter tractable (FPT): large inputs can be solved efficiently and provably optimally, as long as some problem parameter is small. Over the last two decades, significant advances have been made in the design and analysis of fixed-parameter algorithms for a wide variety of graph-theoretic problems. This has resulted in an algorithmic toolbox that is by now well-established. However, these theoretical algorithmic ideas have received very little attention from the practical perspective. Until recently, few fixed-parameter algorithms have been implemented and tested on real data sets, and their practical potential is far from understood. Traditionally, algorithms are designed using simple models of problems and machines. In turn, important results are provable, such as performance guarantees for all possible inputs. This often yields elegant solutions being adaptable to many applications with predictable performance for previously unknown inputs.

In contrast to algorithm theory, taking up and implementing an algorithm is part of application development. Unfortunately, transferring results from theory to practice is a slow process and sometimes the theoretically-best algorithms perform poorly in experiments. Hence, practitioners often do not read research papers from the theoretical algorithms community. This causes a growing gap between theory and practice: Realistic hardware with its parallelism, memory hierarchies, etc. is diverging from traditional machine models. This gap is also partially due to the fact that the research community working on algorithmic problems is fairly separated. On the one hand, there are "hard core" algorithms researchers that are focused mainly on theoretical work and rarely participate in conferences in application areas. On the other hand, researchers of application areas publish their work in conferences and journals of their respective fields, and often do not visit theory conferences. In contrast to algorithm theory, algorithm engineering uses an innovation cycle where algorithm design based on realistic models, theoretical analysis, efficient implementation, and careful experimental evaluation using real-world inputs closes gaps between theory and practice and leads to improved application code and reusable software libraries (see www.algorithm-engineering.de). This yields results that practitioners can rely on for their specific application.

On the one hand, experimental results can trigger new theoretical questions and suggest new properties of inputs that are relevant parameters to use in theoretical analysis. On the other hand, the rich toolbox of parameterized algorithm theory offers a rich set of algorithmic ideas that are challenging to implement and engineer in practical settings. By applying techniques from fixed-parameter algorithms in nontrivial ways, algorithms can be obtained that perform surprisingly well on real-world instances for NP-hard problems. The viability of this approach has been demonstrated in recent years through the Parameterized Algorithms and Computational Experiments Challenge (PACE) [28,54,55,58], in which teams compete to solve real-world inputs using ideas from parameterized algorithm design. Many researchers from all over the world have participated in that challenge. Moreover, the viability of this approach has recently been demonstrated by a wide range of papers. Since the engineering part in the area has recently gained some momentum, we survey recent results and techniques that have started to bridge the gap between theory and practice that is currently observed in the area.

*Theoretical Context.* All known exact and deterministic algorithms that solve NP-hard problems require time that is at least super-polynomial in the total size of the input. However, some problems can be solved by algorithms that run in time which is exponential only in the size of a fixed parameter while polynomial in the size of the input; those are called *fixed-parameter algorithms*. Here, the parameterized problem can be solved efficiently for small values of the fixed parameter. Formally, a parameterized problem is a language *L* ⊆ Σ<sup>∗</sup> <sup>×</sup>N, where Σ is a finite alphabet. The second component is called the parameter of the problem. A parameterized problem *L* is *fixed-parameter tractable* if the question (*x,k*) ∈ *L* can be decided by an algorithm in running time

*f*(*k*)· |*x*| *O*(1) , where *f* is a computable function depending on *k* only. The corresponding complexity class is called FPT.

The *W hierarchy* [56] is an important hierarchy for the complexity of parameterized problems. A parameterized problem is in class *W*[*i*], if we can transform every instance (*x,k*) to a decision circuit (a combinatorial circuit with only a single output gate) with *weft* at most *i*, such that the circuit outputs true if and only if (*x,k*) ∈ *L*. The weft of a combinatorial circuit is the maximum number of gates with more than two inputs on any path from input to output. Downey et al. [56] show that FPT = *W*[0] and that *W*[0] ⊆ *W*[1] ⊆ *W*[2] ⊆···⊆ *W*[*poly*].

Fixed-parameter tractability is closely related to data reduction and kernelization. *Data reduction rules*, or simply *reductions*, reduce the size of a graph while retaining the ability to compute an optimal solution. A graph on which a collection of data reduction rules have been exhaustively applied is called a *reduced* graph. In kernelization, the reduced graph is called a *kernel K* . More formally, given an binary encoded instance (*x,k*) ∈ {0*,*1}∗ <sup>×</sup> <sup>N</sup> of some parameterized problem *<sup>L</sup>*, a *kernelization* for *<sup>L</sup>* produces an instance (*x ,k* ) in polynomial time that satisfies: (*x ,k* ) ∈ *L* ⇔ (*x,k*) ∈ *L* and |*x* |+ *k* ≤ *f*(*k*) where *f* is a computable function. Note that *f* only depends on the problem parameter *k*. So roughly speaking, kernelization can be thought of as a preprocessing routine that reduces a given problem instance to its "most difficult part". The function *f* measures the kernel size. If *f*(*k*) = *O*(*kc*) for some constant *c* then the kernel is called polynomial kernel, and we say the problem admits a polynomial kernel.

Many exact algorithms for parameterized problems combine these data reductions with *branching*. These algorithms are called *branch-and-reduce algorithms*. First, the algorithm aims to reduce the graph size by exhaustively applying reduction rules until there are no further data reductions possible or they are prohibitively expensive. Then, the algorithm picks an edge *e* ∈ *E* (or a vertex *v* ∈ *V*, depending on the problem) and *branches* the problem into multiple subproblems, one subproblem for each potential state of *e* in regard to the problem. As an example, for the maximum cut problem or the multiterminal cut problem, branching creates two subproblems, one in which *e* is part of the cut and one in which *e* is not part of the cut. The branch-and-reduce algorithm then continues to apply reduction rules to both of these subproblems and continues branching when there are no further reductions possible. The branch-and-reduce algorithm returns the best result over all branches.

*Organization.* The rest of the paper is organized as follows. We first survey recent data reduction engineering results for selected NP-hard problems, and then for problems in P. We then describe concrete techniques that may be useful for future implementations in the area. Lastly, we give open problems and research questions.

# **2 Recent Advances for NP-Hard Problems**

#### **2.1 Maximum Independent Set and Minimum Vertex Cover**

Given an undirected graph *G* = (*V,E*), the goal of the *maximum independent set* (MIS) problem is to compute a set of vertices *I* ⊆ *V* such that (1) no two vertices in *I* are adjacent to one another, (2) the set *I* has maximum cardinality among all such sets. The complement of an independent set *I*, *V* \ *I*, is called a *vertex cover*. The MIS problem and the complementary problem of finding a minimum vertex cover (MVC) are wellstudied NP-hard optimization problems [75] that attract both researchers and practitioners alike. Furthermore, there is no polynomial time algorithm that approximates the MIS size within a factor *O*(*n*1<sup>−</sup>ε ) for any constant ε *>* 0, unless P = NP [173]. Finally, MIS is *W*[1]-hard [56] when parameterized by solution size *k*. This makes it unlikely that the problem is fixed-parameter tractable in *k* [56]. On the other hand, MVC is fixed-parameter tractable in solution size *k* [56].

**Exact Approaches.** In recent years, the bridge between theoretically efficient algorithms and their practical applicability has been significantly reduced. In particular, the branch-and-reduce paradigm, i.e., branching algorithms that use a wide variety of reduction rules, have been (1) shown to achieve theoretical running times that are among the best for both MIS and MVC [69,170], and (2) are able to solve large real-world networks in practice [5]. However, most often the approaches used in practice only use a small subset of the reduction rules that have been proposed to achieve good theoretical running times.

Abu-Khzam et al. [4] introduced and analyzed the crown reduction rule (and the usage of data reduction rules in this context in practice). Even though the crown rule is not as powerful as the linear programming (LP)-based rule [133] when considering the worst-case size of the resulting kernel, they experimentally verified that it often performs as well as the LP-based rule and is significantly faster in many cases. Furthermore, they show that the LP-based rule is most useful for fairly sparse graphs and should be avoided for dense graphs, as it yields little to no reduction in size.

Later, Akiba and Iwata [5] were the first to show the practicality of the branch-andreduce paradigm for MVC (and MIS) compared to other state-of-the-art approaches like branch-and-bound and branch-and-cut. Their algorithm uses a wide spectrum of reduction rules that form the foundation of much subsequent work. This includes both conceptually simple reduction rules like degree-1 and degree-2 vertex folding [69], as well as more complicated but practically significant rules like unconfined [169] and an LP-based rule [95,133]. Many of these reduction rules work by removing vertices that are part of some MIS. We illustrate this by briefly covering the degree-1 and degree-2 vertex fold reduction rules: (1) In the degree-1 reduction rule (see Fig. 1) one removes vertices *v* of degree one (and their neighbors), as they are always in at least one MIS. To see this, note that *v* or its neighbor *w* must be in some MIS *I*, otherwise *I* ∪ {*v*} is an independent set of larger cardinality. If *w* is in *I*, one can obtain an independent set of the same size by removing *w* from *I* and adding *v* instead. (2) For the degree-2 vertex fold (see Fig. 2) one removes vertices *v* with exactly two neighbors *u* and *w* that are not adjacent to each other. In this case a new vertex *v* is inserted and connected to the union of the neighborhoods of *u* and *w* yielding a reduction of the graph size by two vertices. Finally, if *v* is part of an MIS *I* of the reduced graph, then *I* = (*I* \ {*v* })∪ {*u,w*} is an MIS of the original graph. Otherwise, *I* = (*I* \ {*v* })∪ {*v*} is an MIS of the original graph. Using their branch-and-reduce algorithm, Akiba and Iwata were able to solve a large variety of instances including social networks, web graphs and road networks. A

**Fig. 1.** Degree-1: Vertices *v* and *u* can be removed.

**Fig. 2.** Degree-2 vertex fold: Vertices *v,u* and *w* can be removed. In this case a new vertex *v* is inserted.

similar approach that uses a quantum annealer to solve instances once they are small enough was recently presented by Pelofske et al. [140].

Although Akiba and Iwata [5] use a sophisticated set of reduction rules, Strash [155] showed that many of the more complicated rules are not necessary to compute an MIS in many large complex networks. Furthermore, the initial reduction rules applied to compute a reduced graph often have a bigger impact on performance, compared to further techniques used during the branch-and-reduce approach. Recently, Stallmann et al. [153] supported this idea by showing that networks *G* with a small normalized average degree (nad(*G*)) can be efficiently handled by simple reduction rules. The nad(*G*) of a network *G* on *n* vertices is defined as the average degree of *G* normalized using a factor of 200*/n* if the average is larger than 20. Otherwise, if the average degree is at most 20, nad(*G*) is the same as the average degree of *G*. Additionally, the authors make use of the so-called degree spread *t/b*, where *t* is the degree at the 95th percentile and *b* at the 5th percentile. Based on these characteristics, the authors devise thresholds that indicate (1) if reductions should be used at all, (2) if more complex rules provide a significant benefit.

**Open Problem 1.** *What are graph characteristics and properties that determine the success of specific reduction rules?*

Recently, Hespe et al. [92] won the PACE Challenge 2019 vertex cover track by using a portfolio of exact approaches for MIS, MVC and maximum clique. In particular they use the reduction rules of Akiba and Iwata as an initial preprocessing step. Afterwards, an initial solution is computed using the state-of-the-art local search algorithm by Andrade et al. [7]. Finally, they switch between the branch-and-reduce algorithm of Akiba and Iwata [5] and the clique solver by Li et al. [119], which are applied to either the original graph or the graph resulting from the preprocessing step.

**Heuristic Approaches.** Reductions are also heavily used in many state-of-the-art heuristic approaches. Lamm et al. [114,113 SPP,150 SPP] use the same set of reductions originally used by Akiba and Iwata to develop an evolutionary algorithm that is able to compute high quality solutions for large graphs that are infeasible for branchand-reduce. The authors use reductions for both preprocessing (to compute a kernel) and during the algorithm itself. In particular, they select vertices that are part of many highly fit individuals, which are independent sets, in their population. These vertices are then added to the resulting independent set, which includes removing them and their neighbors from the graph. Afterwards, reduction rules are applied and the evolutionary algorithm is called recursively on the resulting graph.

The idea of excluding a subset of vertices that are likely to be part of a high-quality independent set, is also explored by Gao et al. [73]. To select these vertices they perform multiple runs of a state-of-the-art local search algorithm (either NuMVC [33] or FastVC [34]). Vertices that are present in all resulting solutions are then added to the final solution and a new graph consisting of the remaining vertices and their corresponding edges is constructed. Afterwards, a final run of the local search on this graph is executed and its solution is combined with the previously removed vertices.

Dahlum et al. [51,150 SPP] combine both simple exact reduction rules as well as inexact reductions with the ARW local search algorithm [7]. In particular, they remove cliques of up to size three (an exact reduction) and the top 1% high-degree vertices (an inexact reduction). The reasoning behind their inexact reduction is that high-degree vertices are not likely to be in a large independent set. Additionally, these vertices pose a significant bottleneck for local search. The authors also compare their algorithm against an algorithm that uses the data reduction rules of Akiba and Iwata as a preprocessing step. A similar preprocessing approach that only uses a subset of reduction rules is also presented by Cai et al. [37]. In particular, they use the degree-0, degree-1, degree-2 and domination rules.

Chang et al. [42] also make use of the idea of combining simple reduction rules that can be applied in (near-)linear time with an inexact reduction rule that removes high-degree vertices. For this purpose, they introduce the reducing-peeling framework that switches between the two types of reductions. Furthermore, they present a set of degree-2 path reductions that are special cases of the folding reduction. Combining these new rules with the degree-0, degree-1, dominance and an LP-based reduction rule, they propose an efficient preprocessing algorithm that is then combined with the ARW local search algorithm.

**Open Problem 2.** *Can one derive (near-)linear time special cases of the more complex reductions like the unconfined reduction that are not covered by existing reductions?*

In order to quickly achieve smaller reduced graphs than what is possible by using simple reduction rules, Hespe et al. [93,150 SPP] provided the first shared-memory data reduction based on the rules of Akiba and Iwata. For this purpose they make use of both graph partitioning and parallel bipartite maximum matchings. The graph partitioning library KaHIP [148] is used to compute a partition of that graph which allows parallel execution of reduction rules that only need to check highly localized subgraphs, where bipartite maximum matchings are used to enable the parallel execution of the LP-based reduction rule. Furthermore, the authors present two speedup techniques for kernelization: (1) dependency checking that prunes applicability checks for certain reductions and (2) reduction tracking that stops their algorithm once the application of reduction rules only decreases the graph size by a negligible amount.

**Open Problem 3.** *Can the techniques used by Hespe et al. [93] be extended to a distributed memory setting? How can one efficiently apply reductions in distributed memory?*

Alsahafy and Chang [6] recently proposed an algorithm that combines the reducingpeeling framework with the exact clique solver MoMC by Li et al. [119]. Their algorithms splits reduction rules into two sets: ones that can be updated and applied incrementally (similar to Hespe et al. [93]), and ones that can not. Additionally, they continuously compute and maintain the connected components of the graph, which are then reduced individually. If a reduced component is small enough, it is then transformed into its complement and solved by MoMC. To ensure that components continue to get smaller, they use the same inexact reduction rule as Chang et al. [42] and then continue recursively on the resulting components. The authors also present a new exact reduction rule called the pyramid reduction.

Lastly, Lavallee et al. [118] evaluated a structural rounding approach for vertex cover. The main idea is to first edit a graph to a well-structured graph which can be solved more easily, and then apply a "lifting" algorithm to the partial solution to recover an approximation on the input network. Lavallee et al. find that their algorithm can outperform standard 2-approximation algorithms and that simpler lifting strategies are highly competitive with more sophisticated strategies.

**Weighted MIS.** Due to the significant practical results achieved for the unweighted case, there has been an increasing interest in generalizing these techniques for the weighted *maximum independent set* (WMIS) and *weighted minimum vertex cover* (WMVC) problems. For both problems, one is given an additional real-valued vertex weighting function *<sup>w</sup>* : *<sup>V</sup>* <sup>→</sup> <sup>R</sup>+. In case of the WMIS problem one is then tasked with finding an independent set, such that the sum of the weights of its vertices is maximum among all possible independent sets. Analogously, for the WMVC one is tasked with finding a vertex cover of minimum weight.

Recently, Li et al. [120] used a set of four reduction rules during the initial construction phase of a local search algorithm. In particular, they use weighted reduction rules that are able to remove degree one and degree two vertices. They then use these reduction rules exhaustively in the beginning of their algorithm to obtain an improved initial solution. Their local search algorithm called NuMWVC is able to compute high quality solutions on a large variety of instances. This includes many instances commonly used for the unweighted case, which have been given vertex weights drawn from a uniform distribution. Since there are not many publicly available weighted instances, this is a common approach that is also used in other works [35,77,172,115 SPP]

Wang et al. [165] also make use of reduction rules for vertices with degree at most 2 as a preprocessing step for a branch-and-bound solver. Furthermore they evaluate different degree-based heuristics for selecting branching vertices and use pruning based on the best solution found so far.

Lamm et al. [115 SPP] proposed a practically efficient branch-and-reduce algorithm for the WMIS problem that is able to solve a large number of real-world instances. For this purpose they develop a comprehensive set of practically efficient reduction rules. These include both generalizations of previous weighted and unweighted reduction rules, as well as two "meta reductions" which serve as a general framework for the other rules. They use these rules to build a branch-and-reduce algorithm that uses many of the approaches that worked well in the unweighted case. In particular, they use local searches to compute initial solution which can be used for pruning, treat connected components individually and make use of dependency checking. Finally, they show that their reduction rules can be used to improve the performance of other state-of-the-art algorithms

Zheng et al. [172] propose an exact and heuristic approach that both make use of reduction rules for vertices of degree at most 2. Their exact approach is a branch-andreduce algorithm that applies these reduction rules recursively. However, the authors do not provide any details on the bounds or branching strategies used during the algorithm. Their heuristic approach is inspired by the reducing-peeling framework of Chang et al. [42]. Thus, it exhaustively applies their reduction rules and subsequently removes high-degree vertices to extend the space of possible reductions.

Gellner et al. [77] proposed new practically efficient variants of the struction rule by Ebenegger et al. [59]. The struction is a reduction that is able to be applied to arbitrary vertices in a graph, but comes at the cost of potentially increasing the overall number of vertices. Thus the authors propose three new variants of the struction that aim to limit the number of newly created vertices. Furthermore, they derive practically efficient special cases of their reduction rules and use them as a preprocessing step in the branch-and-reduce solver of Lamm et al. [115 SPP]. The algorithm is able to produce the smallest-known reduced graphs, solves more instances than previous exact approaches and has a running time that is comparable to heuristic algorithms.

**Open Problem 4.** *Can other problems also benefit from reductions that may temporarily increase the graph size? If so, how much of an increase should be allowed to remain practical?*

#### **2.2 Finding and Enumerating Maximum Cliques**

Given an undirected graph *G* = (*V,E*), the goal of the *maximum clique* (MC) problem is to compute a set of vertices *C* ⊆ *V* such that (1) all vertices in *C* are adjacent to one another, (2) the set *C* has maximum cardinality among all such sets. As mentioned in the previous section, MC solvers are often used in the context of independent sets. This is due to the fact that a clique of *G* is an independent set in the complement graph *<sup>G</sup>*¯ = (*V,E*¯) with *<sup>E</sup>*¯ <sup>=</sup> {{*u,v*} | *<sup>u</sup>,<sup>v</sup>* <sup>∈</sup> *<sup>V</sup>* ∧ {*u,v*} <sup>∈</sup> *<sup>E</sup>*}. Thus, one can leverage maximum clique algorithms for finding independent sets by computing the complement graph. Since many algorithms for finding maximum independent sets aim to perform well on sparse graphs, the resulting complement graphs that need to be handled by clique algorithms will often be dense. Fortunately MC has been more extensively studied for dense instances than for sparse instances. Like MIS and MVC, finding a maximum clique is also an NP-hard optimization problem [75]. Furthermore, unless P=NP, there is no polynomial time algorithm that approximates the MC size within a factor *O*(*n*1<sup>−</sup>ε ) for any constant ε *>* 0 [173]. Finally, MC is *W*[1]-hard [56] parameterized by solution size *k*, making it unlikely that the problem is fixed-parameter tractable in *k*. However, it is fixed-parameter tractable under different parameterizations, e.g., when parameterized with the degeneracy of the graph [64]. All previous observations also hold for the *maximum clique enumeration* (MCE) problem of enumerating all maximum cliques in a graph.

Eblen et al. [60] presents a maximum clique solver (MCF) that adapts some of the reduction rules that have already been shown to work well for MVC and MIS. In particular, their algorithm begins by greedily computing a large clique*C* which is then used as a lower bound in order to remove vertices of degree less than |*C*|−1 [1]. Next, they use an adaptation of the degree-0 reduction rule previously used in MVC algorithms, as well as a rule based on heuristic colorings [160] to remove additional vertices. The authors also investigate the use of other reduction rules including an adaptation of the degree-1 reduction rule used in MVC algorithms. Finally, they compare applying reduction rules as a preprocessing method for a branch-and-bound solver against running them in a branch-and-reduce solver. Their experiments indicate that the branch-and-reduce approach performs better on real-world genome data.

Eblen et al. [60] then use the previous MCF solver to develop several approaches for the maximum clique enumeration (MCE) problem based on the algorithm by Bron and Kerbosch [30]. In particular, they develop two reduction rules based on MCF: First, they propose a reduction rule that uses MCF to compute a maximum clique cover and removes vertices not adjacent to this cover. Second, they propose a second datadriven preprocessing rule that computes so-called essential vertices, i.e., vertices that are present in every maximum clique. Vertices that are not adjacent to these vertices are subsequently removed from the graphs. Their experiments indicate that this rule works particularly well on large transcriptomic graphs, that often have a small set of essential vertices. However, its performance degrades for networks that do not have a small set of essential vertices, e.g., for uniform random graphs.

#### **Open Problem 5.** *Can one give similar data-driven reduction rules for other types of networks, e.g., social networks or road networks?*

Verma et al. [164] propose another type of reduction rule based on *k*-communities. A *k*community is defined as a subgraph *G* = (*V ,E* ) where each edge {*u,v*} ∈ *E* connects vertices that have at least *k* common neighbors in *G* . Subsequently, a subset of vertices *V* ⊆*V* is called a *k*-community if there is a *k*-community with vertex set *V* in *G*. Note, that a clique of size *k* is a (*k* −*t*)-community for any *t* ∈ {2*,...,k*}. They then derive a reduction rule which computes a lower bound on the clique size based on maximum (*k* −2)-communities and prune vertices with a smaller degree. They then combine this reduction rule with the *k*-core based approach of Pardalos and Resende [1] and show that the resulting algorithm works well for handling large low-density graphs.

Chang [40,41] notes that even though many real-world networks are usually sparse, MC has been more extensively studied for dense instances. Thus, the authors propose a branch-and-reduce algorithm that leverages the existing work on MC for dense instances by transforming an instance of MC over a sparse graph to instances of *k*-clique finding (KCF) over dense subgraphs. For this purpose, the authors iteratively compute small and dense subgraphs (so-called ego networks) that are then handled by a KCF solver. In order to reduce the size of the subgraphs that are handled by this solver, their algorithm uses a combination of well-known upper bounds and lightweight reduction rules. In particular, they use five reduction rules for KCF, most of which are targeted toward removing vertices of high degree. The authors also present a heuristic algorithm for MC, as well as a two stage approach for MCE that makes use of their exact algorithm to compute the size of the largest clique. Furthermore, they show that the reduction rules used for MC can also be adapted for MCE.

**Weighted MC.** Recently, Cai and Lin [36] proposed the first (and only) practical algorithm for the *(vertex-)weighted maximum clique* (WMC) problem that uses reduction rules. The WMC problem is a generalization of MC where one is given an additional real-valued vertex weighting function *<sup>w</sup>* : *<sup>V</sup>* <sup>→</sup> <sup>R</sup>+. Subsequently, one is tasked with finding a clique, such that the sum of the weights of its vertices is maximal among all possible cliques. In order to solve WMC on large sparse graphs, Cai and Lin [36] interleave clique construction with reduction rules. To be more specific, they gradually add "beneficial" vertices to a clique using an approximation of the benefit of a vertex. This is done by computing the mean of a cost-efficient upper and lower bound for each vertex and then selecting vertices using a dynamic best from multiple selection [34]. Finally, if a new best clique is found, the graph is reduced using two reduction rules. Both rules make use of the fact that one is able to remove vertices where an upper bound on any maximum clique containing this vertex is smaller than the weight of the current best clique. For their rules, the authors then propose two different upper bounds that make use of the neighborhood of a vertex.

*k***-plexes.** A *k*-plex is a generalization of a clique where each vertex is allowed to have several missing connections, i.e., not every vertex has to be connected to all other vertices in the *k*-plex [151]. In particular, a *k*-plex is a subset *S* ⊆ *V* such that the degree of every vertex in the induced subgraph *G*[*S*] is at least |*S*| −*k*. Furthermore, |*S*| is called the size of the *k*-plex and the *maximum k-plex problem* (MK) is that of finding a *k*-plex of maximum size.

Gao et al. [72] present multiple theoretical properties that allow the removal of vertices based on a lower bound on the maximum *k*-plex size. Based on these properties they propose four reduction procedures which are then used in a branch-and-reduce algorithm. In particular, they then use an extension of the algorithm by Jiang et al. [100] to compute an initial lower bound and then use this bound to exhaustively apply their linear-time vertex reduction and the more costly subgraph reduction rules for preprocessing. Afterwards they use different sets of reduction rules depending on the type of branch (selecting or discarding a vertex). The authors also present a type of targeted branching that aims to select vertices which lead to a larger reduction in size. The resulting algorithm is able to solve multiple previously infeasible real-world instances and is considerably faster than previous state-of-the-art solvers (e.g., [168]).

**Open Problem 6.** *Can targeted branching be used for other problems? For example, the most commonly used branching strategy for independent sets is degree-based and does not take any reduction rules into account.*

Conte et al. [46] investigated reduction rules for the problem of enumerating all maximum *k*-plexes. For this purpose, they introduce the concepts of coreness and cliqueness. Coreness states that vertices of a *k*-plex of size at least *m* must have a degree not smaller than *m*−*k*. Thus, vertices with a smaller degree can iteratively be removed, resulting in the computation of (*m*−*k*)-cores. Cliqueness states that every vertex of a *k*-plex of size at least *m* is part of a clique not smaller than *m/k* . Therefore, vertices with a degree less than *m/k* can be removed from the graph. Furthermore, if one knows the size of the maximum clique ω the search space for the size of the maximum *k*-plex can be limited to [ω*,*ω · *k*]. Based on these observations the authors then present an algorithm that begins by computing the size of a maximum clique. Afterwards a lower bound for the size of the maximum *k*-plex *p* ∈ [ω*,*ω ·*k*] is guessed. If this guess turned out to be wrong (i.e., all *k*-plexes found are smaller than *p*), the interval bounds are updated and a new lower bound is guessed. Otherwise, all *k*-plexes with maximum size are returned. Their algorithm is able to reduce a large set of instances by up to 99% and achieves running times that are multiple orders of magnitude faster than previous approaches [14].

#### **2.3 Maximum Cuts**

The *max-cut* problem originates from important applications in physics and operations research [10]; therefore, it has long been the subject of engineering more and more sophisticated algorithms which solve large-scale instances arising in practice. In particular, max-cut is one of the few problems where engineers and practitioners alike are interested in finding optimal solutions (rather than just approximate ones). Formally, the max-cut problem takes as input an edge-weighted graph *G* and seeks a bipartition of the vertex set *V* of *G* into two disjoint parts, *V*<sup>1</sup> and *V*2, which maximizes the weight of the edges which *cross* the bipartition, that is, edges whose one endpoint is in *V*<sup>1</sup> and the other endpoint is in *V*2. The state of the art for max-cut though is that even after much effort, optimal solutions are still unknown for several benchmark instances. Those reasons are the key motivations for engineering effective, and efficient, kernelization rules. The objective is to reduce the given graph *G* to a new instance *G* of smaller size, such that a maximum cut in *G* can be recovered efficiently from any maximum cut in *G* . To the best of our knowledge, preprocessing rules with theoretical guarantees have been studied so far mainly for the unit-weight max-cut. That special case of max-cut, where all edges have the same (unit) weight, is still NP-hard. The goal is thus to find a bipartition (*V*1*,V*2) which maximizes the size of the cut, which is the number of edges with one endpoint in *V*<sup>1</sup> and the other endpoint in *V*2. To measure the effectiveness of preprocessing rules for unit-weight max-cut, one introduces an integer parameter *k*. This parameter measures the difference between the size of the maximum cut, and the value *m/*2 − (*n* − 1)*/*4, which is the well-known lower bound on the size of the maximum cut in any *m*-edge *n*-vertex graph, due to Edwards and Erdos [ ˝ 61,62]. There is a set of preprocessing rules, devised by Etscheid and Mnich [66 SPP] which compresses any *m*edge *n*-vertex graph *G* in linear time to a graph *G* on just *O*(*k*) vertices, while allowing to recover the maximum cut of *G*. This set of rules strengthened earlier work by Crowston et al. [47 SPP], and is moreover the asymptotically best possible. To understand the practical relevance of those rules, Ferizovic et al. [68 SPP] expanded and engineered them. They demonstrated their significant impact on benchmark data sets, including synthetic instances, and data sets from the VLSI and image segmentation application domains. Their experiments revealed that current state-of-the-art solvers can be sped up by up to multiple orders of magnitude when combined with their data reduction rules. On social and biological networks in particular, the preprocessing enabled them to solve four instances that were previously unsolved in a ten-hour time limit with stateof-the-art solvers; three of these instances are now solved in less than two seconds. It is possible to expand the work on preprocessing for unit-weight max-cut to instances with all positive weights. However, designing practically-efficient preprocessing rules for the general max-cut problem, which also provides theoretical guarantees on the kernel size, remains a challenge. Recent work in this direction was done by Lange et al. [116], who designed reduction rules for general max-cut. They showed the efficacy of their rules on instances from computer vision, biomedical image analysis and statistical physics, and for those instances managed to obtain substantial size reductions.

**Open Problem 7.** *Is it possible to engineer efficient reduction techniques for max-cut with general edge weights?*

## **2.4 Treewidth and Treedepth**

Many NP-hard graph problems can be efficiently solved when the input graph is a tree. A tree decomposition maps vertices of a graph to vertices in a tree, which allows techniques for trees, especially dynamic programming, to be adapted to arbitrary graphs. However, the quality of the tree decomposition impacts the efficiency of such algorithms. *Treewidth* [146] is one measure of this quality, which has been extensively studied in parameterized algorithms literature, which we now describe.

Formally, a *tree decomposition* of a graph *G* = (*V,E*) is a family of subsets *X* ⊆ <sup>2</sup>*<sup>V</sup>* \ {∅} of *<sup>V</sup>* called bags, together with a tree *<sup>T</sup>* = (*<sup>X</sup> ,F*), such that


The *width* of a tree decomposition of *G* is one less than the cardinality of its largest bag, that is, max*X*∈*<sup>X</sup>* {|*X*|} −1. The treewidth of *<sup>G</sup>*, denoted tw(*G*), is the minimum width over all tree decompositions of *G*.

Unsurprisingly, computing tw(*G*) is NP-hard and deciding if tw(*G*) <sup>≤</sup> *<sup>k</sup>* for some positive integer *k* is NP-complete. This *treewidth* problem is a canonical problem with many theoretical and practical results in the literature. It is fixed-parameter tractable with running time 2*O*(*k*3) *n* [21], implying it has a kernel exponential in *<sup>k</sup>*<sup>3</sup> [32]. The problem does not have a kernel size subexponential in *<sup>k</sup>* unless NP <sup>⊆</sup> coNP/poly [22]. Hence, most work focuses on constructing tree decompositions of small width, either approximately [23], or exactly using methods such as positiveinstance driven dynamic programming [156]. Both the first and second PACE Challenges had a treewidth track [55]. However, polynomial kernels exist for other parameters. Bodlaender et al. [25] give polynomial kernels of size *O*(fvs(*G*)4) and *O*(vc(*G*)3), where fvs(*G*) is the size of a minimum feedback vertex set and vc(*G*) the size of a minimum vertex cover of *G*, respectively. Their work is inspired by data reduction rules that are known to work well in practice (discussed below), and also includes new rules based on the notion of "clique seeing" paths. Jansen [98] improved the latter kernel to size *O*(vc(*G*)2) by introducing a new reduction rule to efficiently find independent sets whose elimination has a predictable effect on the treewidth. To the best of our knowledge, no experiments have been done with clique seeing paths or Jansen's reduction.

#### **Open Problem 8.** *Is the rule of Jansen [98] effective in practice?*

Much work has been done in making practical data reductions for the treewidth problem. In early work, Arnborg and Proskurowski [8] introduced reduction rules for recognizing and characterizing partial 3-trees. Bodlaender et al. [27] categorized these reductions into six types (islet, twig, series, triangle, buddy, and cube) and extended these rules, showing them to be highly effective at reducing graph size in practice [27]. Of note here are two variations of well-known reductions from other problems: simplicial vertices and twins of degree 3. They further give a reduction for *almost* simplicial vertices (vertices with all but one neighbor inducing a clique). On graphs with up to 3 032 vertices, the reductions quickly remove 77% of vertices on average, whereas the simplicial vertex reduction alone remove 51% of vertices on average. The worst performing instances had 30% of their vertices removed. Den van Eijkhof et al. [63] generalized many of these reduction rules. They not only introduce new weighted variants, but generalize most previous reductions with a "contraction" reduction rule, and further introduce a reduction for twins of higher degree.

Later, Bodlaender et al. [26] introduced the concept of a safe separator, which decomposes the graph into subgraphs that can be solved independently. It was already known that clique separators were safe [136]; however, they generalize the concept and introduce other easy-to-find separators. They further show that previous reduction rules are subsumed by their safe separator technique. In experiments, their reductions decomposed 33 out of 40 instances. When run as a preprocessing step, their technique speeds up an existing triangulation heuristic, sometimes by multiple orders of magnitude. However, it only gives modest speedups over preprocessing using existing reductions.

**Open Problem 9.** *How effective are existing treewidth reductions on large sparse graphs (e.g.,with millions of vertices) in practice?*

**Open Problem 10.** *Can heuristic methods be used to efficiently find safe separators in practice?*

A related concept exists for decompositions into rooted trees. A *treedepth decomposition* of a graph *G* = (*V,E*) is a rooted forest *F*, together with an injective mapping φ : *V*(*G*) → *V*(*F*) such that, for each edge (*u,v*) ∈ *E*, one of φ(*v*) or φ(*u*) is an ancestor of the other. The treedepth of *G*, denoted by td(*G*), is the minimum height of any treedepth decomposition of *<sup>G</sup>*. The *treedepth* problem, deciding if td(*G*) <sup>≤</sup> *<sup>k</sup>* for some positive integer *k*, is NP-complete [142].

Many similar results exist for the treewidth and treedepth problems. Reidl et al. [145] give a fixed-parameter tractable algorithm for treedepth *k*, with running time 2*O*(*k*2) *n*, implying the existence of a kernel of size exponential in *k*2, and no subexponential kernel exists unless NP ⊆ coNP/poly [22]. However, when parameterized on the vertex cover number vc(*G*), the problem has a kernel of size *O*(vc(*G*)3) [109], which is achieved through two simple reduction rules that also apply to treewidth: removing simplicial vertices and adding edges between vertices with at least *k* common neighbors.

However, as far as we are aware, there are significantly fewer experimental works with data reduction rules for treedepth. The 5th PACE Challenge in 2020 was dedicated to exact and heuristic solutions for treedepth. The winning solver by Trimble [161] did not employ any data reduction rules (instead using symmetry breaking together with a variety of lower bounding techniques); however, the second place solver by Korhonen [112] applies the simplicial vertex rule by Kobayashi and Tamaki [109] and a generalization of their common neighbor rule. Korhonen further introduces a new reduction rule based on the Schäffer's linear-time algorithm [149] for computing the treedepth of trees. This rule replaces a tree subgraph *G*[*T*] having |*N*(*V* \ *T*)| = 1 with a subgraph of size td(*G*[*T*] <sup>2</sup>). As far as we know there are no published results on the efficacy of these reduction rules. Of further interest is that this algorithm uses minimal separator enumeration. We conclude with the following open problems.

**Open Problem 11.** *How effective are the reductions of Kobayashi and Tamaki [109] and Korhonen [112] in practice?*

**Open Problem 12.** *Does the notion of a safe separator extend to the treedepth problem?*

#### **2.5 Hitting Set**

Given a set *S* along with a collection *C* of its subsets, the *hitting set* problem asks for a subset of *S*, of minimum cardinality, that has a non-empty intersection with each and every member of *C*. Hitting set is the dual of *set cover*, which seeks a minimumcardinality subset of*C* whose union is *S*. If the elements of *S* and*C* are treated as red and blue vertices, respectively, of a bipartite graph, the equivalent graph theoretic problem is known as *red-blue dominating set* (RBDS).

Hitting set is NP-hard, and *W*[2]-hard when parameterized by the solution size [56]. It becomes fixed-parameter tractable when each member of *C* is of size bounded by a constant *d*. In this case the problem is often referred to as *d*-Hitting Set and it corresponds to RBDS restricted to (red-blue) graphs where each red vertex has at most *d* neighbors. The problem is also known to be fixed-parameter tractable when parameterized by |*C*|, but this particular parameter is expected to be large in practice. The most popular reductions for Hitting Set are due to Weihe [166]. They are simply based on removing any possible redundant elements from *S* and *C*. In this context, an element of *S* is redundant if all members of *C* that contain it contain another element; while a member of *C* is redundant if it is a superset of another member of *C*. The application of these two rules alone proved to be highly effective on large public transportation networks resulting in a huge reduction in size as pointed out recently by Bläsius et al. [15].

More sophisticated reduction algorithms appeared in the context of kernelization for *d-hitting set* [2,129,134]. The kernelization approach of Abu-Khzam [2] was adopted by Mellor et al. [125] and proved to be effective in the context of multiple drug selection for cancer therapy. Moreover, linear-time algorithms that can obtain a kernel of size *O*(*kd*) were presented by van Bevern [162] and Fafianie and Kratsch [67]. Practical implementations of these algorithms have been addressed recently by van Bevern and Smirnov [163] where they were shown to be more efficient than the reduction procedure of Weihe [166] for small *d* (up to 5), but can result in more effective data reduction when combined with the reduction rules of Weihe [166].

#### **2.6 Steiner Trees**

Given an undirected graph with non-negative edge weights as well as a subset of the vertices (terminals), the *Steiner tree* problem is to find the lightest tree spanning the terminals. There has been a wide range of implementations tackling the Steiner tree problem. Data reductions have long been used for the problem, see, e.g., Polzin [141] or Daneshmand [52]. Daneshmand [52] in particular has shown already in 2004 that many Steiner tree problem instances can be solved by reduction- and heuristic-based approaches.

Recently there have been two implementation challenges, the 11th DIMACS Implementation Challenge in 2014 and the 3rd PACE Challenge [28] in 2018. Here, we focus on the most successful implementations of the 3rd PACE Challenge and the approaches that have been published afterwards. The PACE Challenge had three tracks overall – two exact tracks with one focusing on algorithms for problems with few terminals and one focusing on problems with low treewidth, as well as one heuristic track.

The implementation of Iwata and Shigemura [96] won the track with problems that have few terminals. Their algorithm is based on the dynamic programming formulation of Erickson-Monma-Veinott [65] which has a theoretical running time of *O*(3*<sup>t</sup> n*+2*<sup>t</sup>* (*n*log*n*+*m*)) with *t* being the number of terminals. Iwata and Shigemura use a novel separator-based pruning technique to speed up their implementation (while keeping the worst-case bound of Erickson-Monma-Veinott). This technique allows them to prune a large number of entries in the dynamic programming table.

The track with problems that have low treewidth was won by SCIP-Jack [143,144] due to Koch and Rehfeldt. This approach is based on the branch-and-cut principle and was already very successful during the 11th DIMACS Implementation Challenge. For the PACE Challenge, the authors use data reductions that typically reduce the number of edges in the problems by more than 90%. Many instances can already be completely solved by presolving. Moreover, on the remaining instances that can not be presolved, the authors use heuristics to find strong upper and lower bounds quickly. The authors find that in more than 90% of cases that the heuristic already finds the optimum solution on the instances that have not been presolved. Lastly, the branch-and-cut procedure is used to compute lower bounds and prove optimality. Later, the approach was improved [152] to run in distributed memory and thus, by using up to 43 000 cores, managed to solve additional previously unsolved instances or improved on the previously best known solution.

**Open Problem 13.** *Are there new reductions that have not yet been tried in practice that could help to solve more instances to optimality in practice?*

**Open Problem 14.** *Can existing reductions for the standard Steiner tree problem be transferred to the more general multi-level Steiner tree problem?*

#### **2.7 Minimum Fill-In**

The *minimum fill-in* problem is a critical problem that accelerates Gaussian elimination when solving sparse linear systems [147]. Given a matrix *A* representing the sparse linear system *Ax* = *b*, the goal is to find a permutation matrix *P* that minimizes the number of non-zeros introduced when factorizing *A* = *PAPT* . Equivalently, treating *A* as the adjacency matrix of a graph *G* = (*V,E*), we wish to minimize the number of edges introduced in an *elimination ordering*, defined as follows. An *elimination step* removes a vertex *v* ∈ *V* and its incident edges, and adds edges between non-adjacent vertices in *NG*(*v*), producing an elimination graph *Gv*. An *elimination ordering* of *G* is a permutation *v*1*v*2*..vn* of all the vertices in *G*, and the *fill-in* of the ordering is the number of edges introduced by eliminating vertices *v*1*,v*2*,...,vn* in this order. The minimum fillin is the smallest fill-in given by any elimination ordering. We are often interested in not just computing the minimum fill-in, but an elimination ordering that has minimum fill-in.

Not only is the minimum fill-in NP-hard to compute [171], no polynomial time approximation scheme exists for the problem unless P = NP [39]. However, the problem is fixed-parameter tractable [103], when the input parameter *k* is the minimum fillin. The fastest known fixed-parameter algorithm for the problem is due to Fomin and Villanger [71], with running time 2*O*( √ *<sup>k</sup>* log*k*) + *O*(*k*2*nm*), where the additive *O*(*k*2*nm*) term is the time to compute a kernel of *O*(*k*3) vertices [102]. Note that this algorithm is subexponential in the minimum fill-in *k* and, moreover, is nearly optimal: Cao and Sandeep [39] showed that no algorithm with running time 2*O*(*k*1*/*2<sup>−</sup>δ) · *<sup>n</sup>O*(1) exists for any positive constant δ, assuming the exponential time hypothesis holds. The smallest known kernel for the problem is due to Natanzon et al. [132] has 2*k*<sup>2</sup> + 4*k* vertices. The reductions all have the same flavor and are derived for the equivalent problem of *chordal completion*: finding the minimum number of edges to add to the graph so that it is chordal. Kernelization is done by partitioning the vertices into two sets *A* and *B* where *B* induces a chordal graph and *A* contains vertices from every chordless cycle in *G*. The set *A* is formed by repeatedly finding chordless cycles in *G*[*B*] via the MCS algorithm [157,158] and moving a subset of their vertices to *A* until *G*[*B*] is chordal. Then *essential edges* are added to the chordless cycles induced by *A*, which is the kernel.

In practice, the minimum fill-in problem is extremely hard to solve exactly. Indeed, in the 2nd PACE Challenge in 2017, the winning solver for the minimum fill-in problem only solved 54 out of 100 instances [55], when each instance is given a 30-min time limit. The top three submissions all used kernelization [102] together with dynamic programming over potential maximal cliques [29,156]. The first place submission by Kobayashi and Tamaki used generalized variants of the data reduction rules of Bodlaender et al. [24], and the third place submission performed preprocessing adapted from the safe separator technique for treewidth [26] in addition to kernelization [102].

However, heuristics, including nested dissection [78] and minimum-degree ordering [159], work quite well in practice for real-world (typically sparse) graphs. Early researchers noted that indistinguishable vertices may be eliminated together, and therefore may be collapsed into a representative vertex while ordering [9,57]. This reduction speeds up the minimum degree algorithm by more than a factor two in experiments [79]. Ost et al. [137 SPP] recently introduced new data reduction rules based on twins, simplicial vertices, and path compression, and experiments show that they are highly effective in practice when applied before running nested dissection. For road networks, when used as a preprocessing step with other inexact reductions, their techniques give speedups of between 1.79 and 6.37 over nested dissection while simultaneously reducing the fill-in. On social networks, their reductions yield speedups of between 1.72 and 3.92 on 19 out of 21 social networks tested, and the fill-in was reduced on all but one instance.

**Open Problem 15.** *How effective are the reductions by Ost et al. [137 SPP] when combined with other reductions [132]?*

**Open Problem 16.** *Is branch-and-reduce feasible for the minimum fill-in problem?*

#### **2.8 Vertex Coloring**

Given an unweighted, undirected simple graph *G* = (*V,E*), the *q-coloring* problem asks if there exists an assignment of at most *q* colors to all vertices in *V* such that no two adjacent vertices have the same color (i.e., a *proper coloring*). The problem of finding the minimum number χ(*G*) of colors for which a proper coloring of *G* exists is known as the *chromatic number* problem.

These problems have received considerable attention by the parameterized algorithms community; however, somewhat surprisingly, there is a wide divide between theory and practice. In theory, a kernel parameterized on only the number of colors is unlikely: since graph coloring is NP-hard for *q* = 3 colors [74], this would give a constant-sized kernel, implying P=NP. Therefore, research has focused on other parameters.

When considering the treewidth tw(*G*) of the graph *G*, if *G* is given together with a tree decomposition of width *<sup>k</sup>* <sup>≥</sup> tw(*G*), dynamic programming over the tree decomposition gives an algorithm solving *q*-coloring in time *qkkO*(1) *n* [49, Theorem 7.9]. Assuming the Strong Exponential Time Hypothesis (SETH) no algorithm of running time *O*(*q*−ε)tw(*G*) exists [122] for any ε *>* 0. Using the same technique, the chromatic number can be computed in time *kO*(*k*) *n* [49, Theorem 7.10]. Since these algorithms are fixed-parameter algorithms, the result due to Cai et al. [32] implies kernels of size *qkkO*(1) and *kO*(*k*) exist for *q*-coloring and chromatic number, respectively. Treewidth is often small for sparse graphs in practice; however, as far as we know, these techniques have not been tried in practice, leading to the following open problem.

## **Open Problem 17.** *How effective is dynamic programming over a tree decomposition for q-coloring (or chromatic number) on sparse graphs in practice?*

Another parameter of interest is size of a minimum vertex cover. Recently, Jansen and Pieterse [99] gave a kernel parameterized on the number *q* ≥ 3 of colors and the size *k* of a minimum vertex cover, having size *O*(*kq*−<sup>1</sup> log*k*) bits, which is optimal up to a factor of *kO*(1) [97]. Their result also applies for a tighter parameter, when *k* is the size of the twin cover. Their technique uses constraint satisfaction with low-degree polynomials. However, in practice, sparse graphs often have a minimum vertex cover size that is linear in the number of vertices. Thus, to be useful in practice, the actual kernel would need to have significantly smaller size. However, to date no one has tested their method in practice, leading to our next open problem for *q*-coloring.

## **Open Problem 18.** *How effective are the reductions of Jansen and Pieterse [99] in practice?*

The data reductions that have been implemented in practice are simple and without theoretical guarantees on the size of the reduced graph; however, they are also very effective on large sparse graphs. In particular, in experiments for a branch-and-cut algorithm, Mendéz-Díaz and Zabala [126] first preprocess the input graph by computing a large maximal clique *K* of *k* vertices, which is a lower bound on the chromatic number. They then iteratively remove each vertex *v* of degree at most *k*−1 (resulting in a *k*-core), which is possible since χ(*G*) = χ(*G*− {*v*}). They further give a rule to remove certain vertices with non-neighbors in *K*. In experiments on 63 graphs of up to 5 231 vertices from the second DIMACS Implementation Challenge1, their data reductions reduced all graphs between 1–93%, working best on sparse instances. 36 of the 63 instances were reduced by at least 25%, and 21 instances were reduced by at least 50%. The largest percentage reduction was 93% for the homer instance, reducing from 561 to 38 vertices.

Verma et al. [164] extend this technique. They first compute lower and upper bounds for the chromatic number, and then iteratively apply the *k*-core reduction to heuristically color graphs for decreasing values of *k*. Their key contribution is beginning with an exact coloring of the *k*-core, which gives a better bound than an initial clique. With this technique they are able to exactly find the chromatic number for very large sparse graphs with up to millions of vertices, with running time varying from seconds to hours. In total they solve 33 of 53 instances from SNAP<sup>2</sup> and the tenth DIMACS Implementation Challenge3. Lin et al. [121] extended the low degree reduction to remove entire independent sets of vertices with low degree, which in some cases is orders of magnitude faster than the algorithm of Verma et al. [164]. However, they are not able to solve any additional instances.

We finally note that a crown reduction exists for the *dual coloring* problem, which asks if the graph has an (*n* − *k*)-coloring [70]. Crown reductions are particularly effective in practice for other problems, specifically the minimum vertex cover problem. In theory, for dual coloring, the crown reduction produces a kernel of size at

<sup>1</sup> http://archive.dimacs.rutgers.edu/pub/challenge/graph/benchmarks/color/.

<sup>2</sup> http://snap.stanford.edu/data.

<sup>3</sup> http://www.cc.gatech.edu/dimacs10/.

most 3*k* −3 [70, Theorem 4.9]. As far as we are aware, no one has performed experiments with this reduction, leading to our final open problem for graph coloring.

**Open Problem 19.** *How effective is the crown reduction [70, Theorem 4.9] for graph coloring in practice?*

#### **2.9 Cluster Editing**

The *cluster editing* problem is as follows: given a graph *G* = (*V,E*), transform it into a vertex-disjoint union of cliques by inserting and deleting a minimum number of edges,i.e., by making a minimum number of editions in the graph. The problem is also known as correlation clustering and has many applications, especially in computational biology [17]. The parameterized complexity of the cluster editing problem using the number of edits *k* as a parameter is well-studied. The currently best known algorithm in theory is due to Böcker [16] and has running time *O*(1*.*62*<sup>k</sup>* +*n*+*m*), where *m* is the number of edges.

There has been a wide range of methods applying fixed-parameter techniques in the area. Dehne et al. [53] presented the first practical implementation of a fixed-parameter based method for cluster editing. Their algorithm is exact and implements the kernelization routines of [82] and adds ideas to bound the search space for the parameter *k* via linear programming. Gramm et al. contributed three reduction rules. For example, if two vertices *u* and *v* have more than *k* common neighbors then the edge {*u,v*} has to be in the solution and is added if it is not present. Moreover, if *u* and *v* have more than *k* non-common neighbors, i.e., vertices that are either neighbors of *u* but not *v* or vice versa, then the edge {*u,v*} does not belong to the solution. Lastly, if *u* and *v* have more than *k* common and more than *k* non-common neighbors, then the given instance has no solution. Overall, their method performs best using a refined branching method with re-kernelization. Interestingly, the experimental analysis of their algorithm shows that binary search may not be the best way to implement a fixed-parameter based approach for cluster editing.

Guo [83] later gave parameter-independent data reductions based on critical cliques, obtaining a linear kernel of 4*k* vertices, which was improved by Chen and Meng [45] to 2*k*. Böcker et al. [20] introduced additional parameter-independent data reductions and find that preprocessing is possible if the number of edge modifications is significantly smaller than the number of vertices in the graph. In addition to the parameterindependent rules they combine their technique with the parameter-dependent reductions from above with lower and upper bounds. Böcker et al. find that they can effectively reduce graphs that satisfy *k* ≤ 25|*V*|, whereas the reductions due to Guo [83] are only effective for *k* ≤ |*V*|*/*2. Their experiments show that computing exact solutions for cluster editing is no longer limited to small or almost transitive graphs. Afterwards, Böcker et al. [18,19] extended their results to the weighted version of the problem in which the weight of an edge yields the cost of deleting or inserting it, and the goal is to apply a set of edge modifications with minimum total weight. To this end, they include non-trivial extensions of the data reduction rules of the unweighted case. Additionally, they present a technique to merge vertices which drastically improves the running time of their algorithm. Recently, Bastos et al. [135] combine exact methods with local search heuristics. More precisely, the authors propose a GRASP and an ILS metaheuristic with different neighborhoods as well as a new reduction rule for the problem. They show that the used data reduction rules can speed up linear programming for some instances up to 95% decreased runtime after using reduction rules and 41% decreased runtime on average on the instances that the solver could solve to optimality.

**Open Problem 20.** *Is it possible to compute small kernels in practice if the parameter k is larger than* 25|*V*|*? Are there any specific data reduction rules for that case? If an instance in practice does not reduce well, does that help to obtain bounds on the parameter k?*

Since the parameter *k* is often large compared to the number of vertices, fixed-parameter algorithms may not always be practical. There has been several attempts to use other parameters such as the number of missing edges per cluster as well as the number of edges between clusters [85], the total number of edge modifications per vertex [3,110]. Abu-Khzam [3], using local parameters that bound the amount of (either or both) edge addition and deletion per vertex resulted in a number of reduction rules, showed how to solve much larger problem instances and apply the problem effectively in data analysis [11,12].

## **2.10 Multiterminal Cut**

The *multiterminal cut* problem with *k* terminals is defined as follows: Its input is an undirected edge-weighted graph *<sup>G</sup>* = (*V,E,w*) with edge weights *<sup>w</sup>* : *<sup>E</sup>* → <sup>N</sup>*>*<sup>0</sup> and its goal is to divide its set of vertices into *b* blocks such that each block contains exactly one terminal and the weight sum of the edges running between the blocks is minimized. It is a fundamental combinatorial optimization problem that was first formulated by Dahlhaus et al. [50] and Cunningham [48]. It is NP-hard for all *b* ≥ 3 [50], even on planar graphs, and reduces to the minimum *s*-*t*-cut problem, which is in P, for *b* = 2. The minimum *s*-*t*-cut problem aims to find the minimum cut in which the vertices *s* and *t* are in different blocks. Most algorithms for the multiterminal cut problem use minimum *s*-*t*-cuts as a subroutine. Dahlhaus et al. [50] give a 2(1−1*/b*) approximation algorithm with polynomial running time. Their approximation algorithm uses the notion of *isolating cuts*, i.e., a minimum cut separating a terminal from all other terminals. They prove that the union of the *b*−1 smallest isolating cuts yields a valid multiterminal cut with the desired approximation ratio. The currently best known approximation algorithm by Buchbinder et al. [31] uses linear program relaxation to achieve an approximation ratio of 1*.*323.

Marx [123] proves that the multiterminal cut problem is fixed-parameter tractable when parameterized by multiterminal cut weight *W* (*G*). Chen et al. [44] give the first fixed-parameter tractable algorithm with running time of 4*<sup>W</sup>* (*G*) · *<sup>n</sup>O*(1) , later improved by Xiao [167] to 2*<sup>W</sup>* (*G*) · *<sup>n</sup>O*(1) and by Cao et al. [38] to 1*.*84*<sup>W</sup>* (*G*) · *<sup>n</sup>O*(1) .

Recently, Henzinger et al. [88] engineer an algorithm that combines the branchand-bound formulation of Xiao [167] with existing and new data reduction rules for the problem and present a shared-memory parallel branch-and-reduce algorithm for the multiterminal cut problem. Experiments indicate that this is orders of magnitude faster than previous ILP formulations for the problem that have been employed by practitioners. Later, reduction rules were combined with local search algorithms for the problem [87 SPP]. The algorithm uses a wide variety of reduction rules with varying computational complexity; using vertex neighborhoods, edge connectivities, articulation points, maximum flows and more criteria to reduce the problem size; Henzinger et al. [87 SPP] report size reductions of up to multiple orders of magnitude in some instances, which make large instances solvable in practice. Additionally, they give an inexact algorithm that aggressively prunes subproblems which likely do not yield an improved solution.

**Open Problem 21.** *Is there an efficient way to find semi-isolated small clusters that can be contracted (either exact or inexact contraction)?*

**Open Problem 22.** *The algorithm by Henzinger et al. [88] uses only reductions that guarantee that the optimal solution remains in the graph. Are there reductions that do not guarantee optimality but give good performance in practice?*

# **3 Recent Advances for Problems in P**

# **3.1 Minimum Cut**

Given an undirected graph with non-negative edge weights, the *minimum cut* problem is to partition the vertices into two sets so that the sum of edge weights between the two sets is minimized. The size of a minimum cut is often also referred to as the *edge connectivity* of a graph [91,130]. Gomory and Hu [81] observed that a (global) minimum cut can be computed with *n*−1 minimum *s*-*t*-cut computations. For the following decades, this result by Gomory and Hu was used to find better algorithms for global minimum cut using improved maximum flow algorithms [105]. Hao and Orlin [84] adapt the push-relabel algorithm to pass information to future flow computations. When a push-relabel iteration is finished, they implicitly merge the source and sink to form a new sink and find a new source. Vertex heights are maintained over multiple iterations of push-relabel. With these techniques, they achieve a total running time of *O*(*mn*log *<sup>n</sup>*<sup>2</sup> *m* ) for a graph with *n* vertices and *m* edges, which is asymptotically equal to a single run of the push-relabel algorithm.

However, for minimum cut algorithms to be viable for applications they must be fast on small data sets and scale to large data sets. Thus, an algorithm should have either linear or near-linear running time, or have an efficient parallelization. *All* existing exact algorithms have non-linear running time [84,91,105], the fastest of which is the deterministic algorithm of Henzinger et al. [91] with running time *O*(*m*log2 *n*loglog2 *n*). Although this is arguably near-linear theoretical running time, it is not known how the algorithm performs in practice. Even the randomized algorithm of Karger and Stein [105], which finds a minimum cut only with high probability, has *O*(*n*<sup>2</sup> log<sup>3</sup> *n*) running time, although this was later improved by Karger [104] to *O*(*m*log3 *n*) and recently improved further by Gawrychowski et al. [76] to *O*(*m*log2 *n*). The algorithm of Karger and Stein can be seen as probabilistic data reduction algorithms as they contract random edges to reduce the problem size, and give the correct answer with a certain probability.

Padberg and Rinaldi [138] give a set of heuristics for edge contraction. Chekuri et al. [43] give an implementation of these heuristics that can be performed in time linear in the graph size. Using these heuristics it is possible to sparsify a graph while preserving at least one minimum cut in the graph. If their algorithm does not find an edge to contract, it performs a maximum flow computation, giving the algorithm worst case running time *O*(*n*4). However, the heuristics can also be used to improve the expected running time of other algorithms by applying them on interim graphs [43].

**Open Problem 23.** *Some reductions of Padberg and Rinaldi [138] potentially check each triangle in a graph. Can pruning be used to efficiently identify which subset needs to be checked?*

Nagamochi et al. [130,131] give a minimum cut algorithm that does not use any flow computations. Instead, their algorithm uses maximum spanning forests to find a nonempty set of contractible edges. The intuition behind the algorithm is as follows: suppose you have an unweighted graph with minimum cut value exactly one. Then any spanning tree must contain at least one edge of each of the minimum cuts. Hence, after computing a spanning tree, every remaining edge can be contracted without losing the minimum cut. Nagamochi, Ono and Ibaraki extend this idea to the case where the graph can have edges with positive weight as well as the case in which the minimum cut is bounded by ˆ λ and show how edges are identified using one modified breadth first search. This contraction algorithm is run until the graph is contracted into a single vertex. The algorithm has a running time of *O*(*mn*+*n*<sup>2</sup> log*n*). Stoer and Wagner [154] give a simpler variant of the algorithm of Nagamochi, Ono and Ibaraki [131], which has a the same asymptotic time complexity. The performance of this algorithm on realworld instances, however, is significantly worse than the performance of the algorithms of Nagamochi, Ono and Ibaraki or Hao and Orlin, as shown in experiments conducted by Jünger et al. [101]. Both the algorithms of Hao and Orlin, and Nagamochi, Ono and Ibaraki achieve close to linear running time on most benchmark instances [43,101].

Based on the algorithm of Nagamochi, Ono and Ibaraki, Matula [124] gives a (2 + ε)-approximation algorithm for the minimum cut problem. The algorithm contracts more edges than the algorithm of Nagamochi, Ono and Ibaraki to guarantee a linear time complexity while still guaranteeing a (2+ε)-approximation factor. Inspired by random contractions, Henzinger et al. [89,150 SPP] first gave an shared-memory parallel algorithm without guarantees on the cut size. The algorithm is randomized, and has running time *O*(*n*+*m*) when run sequentially. It repeatedly reduces of the input graph size with both heuristic and exact techniques, and then solve the smallest remaining problem with exact methods. The core idea of the inexact algorithm is that edges in densely connected regions (i.e., inside a cluster of a clustering) are unlikely to be in a minimum cut. The algorithm further uses exact reduction rules from Padberg and Rinaldi [138]. For example, given a bound ˆ λ on the minimum cut, one can obviously contract each edge having weight larger than ˆ λ, without losing optimality. Experimental results indicate that the algorithm finds optimal cuts on almost all instances. At the same time, even when run sequentially, the algorithm is significantly faster (up to a factor of 4*.*85) than other state-of-the-art algorithms.

Later, Henzinger et al. [86,150 SPP] engineered the fastest known *exact* minimum cut algorithm for the problem. To do so, the authors incorporate the proposed inexact method, use better-suited data structures and other optimizations as well as parallelization of exact methods. More precisely, the exact algorithm uses the *inexact* minimum cut algorithm from above [89,150 SPP] to obtain a better approximate bound ˆ λ for the problem (recall that the algorithm almost always gave the correct result). As known reduction techniques depend on this bound, the better bound enables us to apply more reductions and to reduce the size of the graph much faster. For example, edges whose incident vertices have a connectivity of at least ˆ λ, can be contracted without the contraction affecting the minimum cut. The new exact algorithm outperforms the state-of-theart by a factor of up to 2*.*5 already sequentially, and when run in parallel by a factor of up to 12*.*9. Similar reduction rules were later used by Henzinger et al. [90 SPP,150 SPP] to find all minimum cuts in graphs.

#### **3.2 Matching**

A matching *M* of a graph *G* = (*V,E*) is a subset of edges such that no two elements of *M* have a common endpoint. Many applications require the computation of matchings with certain properties, like being maximal (no edge can be added to *M* without violating the matching property), having maximum cardinality, or having maximum total weight ∑*e*∈*<sup>M</sup> w*(*e*), where *w* is a positive weight function that assigns weights to edges. Although these problems can be solved optimally in polynomial time, optimal algorithms are not fast enough for many applications involving large graphs where we need near linear time algorithms. For example, the most efficient algorithms for graph partitioning rely on repeatedly contracting maximal matchings, often trying to maximize some edge rating function *w*. We refer to Holtgrewe et al. [94] for details and examples. For the *maximum cardinality matching* problem, already in the 1980s data reduction rules were proposed by Karp and Sipser [107]. The rules are able to deal with vertices that have degree smaller than two. For example, it is quite easy to see that a vertex having degree zero can be removed from the graph, or if a vertex has degree one, then there is always a maximum matching that has this edge matched.

Möhring and Müller-Hannemann [128] were among the first to use the rules to speed up heuristic algorithms for the general maximum cardinality problem. As exact algorithms for the matching problems typically search for augmenting paths, they can be sped up by using a good initial matching. Hence, later Langguth et al. [117] analyzed the effects of various initializations on the total running time of several exact algorithms for the bipartite maximum cardinality problem and are able to achieve significant speedups.

Korenwein et al. [111] implement (near-)linear time data reduction rules for the unweighted case as well as the positive-integer-weight case. Applied reductions include Karp-Sipser rules, as well as rules due to Mertizios et al. [127] who have also shown that the maximum cardinality matching problem admits a kernel with at most 12*k* vertices and 13*k* edges where *k* is the feedback edge number. Moreover, Koana et al. [111] transfer results from vertex cover to the matching problem, e.g.,crown and LP-based data reductions. Experiments indicate that using data reduction rules can speed up stateof-the-art solvers by a factor of 4.7 for the unweighted case and 12.72 on average in the weighted case.

**Open Problem 24.** *Can the reduction rules due to Koana et al. [111] be exhaustively applied in linear time? Are there more rules that can be transferred from vertex cover to the matching problem that can be applied in near-linear time?*

Kaya et al. [108] also use Karp-Sipser-based kernels for bipartite graph matching. In particular, the authors describe an efficient implementation as well as modifications to reduce time complexity on worst-case instances. Their implementation is about a factor 2 faster then the general purpose implementation of Koana et al. [111]. Recently, Panagiotas and Uçar [139] engineer fast almost optimal algorithms for bipartite graph matching. To this end, the authors investigate two randomized algorithms by Karp et al. [106] and Goel et al. [80] and convert them to efficient heuristics for bipartite graphs. In particular, the algorithm by Karp [106] incorporates Karp-Sipser rules. Both of their heuristics run in near linear time and obtain matchings whose cardinality is more than 99% of the maximum.

**Open Problem 25.** *Is it possible to implement the degree-2 vertex Karp-Sipser rule in linear time?*

# **4 Engineering Techniques**

Engineering techniques are necessary to make data reduction algorithms scale in practice. We give a short overview of techniques that are currently used in practice. The techniques we reference here include dependency checking, reduction tracking, plateau/increasing data transformations, limiting to simple and fast reductions, reduce and peeling, limited reductions, on-the-fly reductions and lastly parallelization.

*Dependency checking* allows pruning of reductions when they will provably not succeed, therefore significantly reducing the number of failed reductions. To compute a kernel, algorithms typically apply their reductions *r*1*,...,rj* by iterating over all reductions and trying to apply the current reduction *ri* to all vertices. If *ri* reduces at least one vertex, they restart with reduction *r*1. When reduction *rj* is executed, but does not reduce any vertex, all reductions have been applied exhaustively, and a kernel is found. Trying to apply every reduction to all vertices can be expensive in later stages of the algorithm where few reductions succeed. The algorithm may repeatedly attempt to apply the same reduction to a vertex even though the graph has not changed sufficiently to allow the reduction to succeed. Checking dependencies between reductions [93], allows to avoid applying certain local reductions when they will provably not succeed, e.g.,if their relevant neighborhood did not change since the reduction was last checked. Therefore dependency checking keeps a set *D* of *viable* candidate vertices: vertices whose relevant neighborhood has changed and vertices that have never been considered for reductions. Then reductions are only applied to candidates that are in the set *D*. This avoids a lot of work and can speed up data reduction significantly.

*Reduction Tracking.* The algorithm by Hespe et al. [93] stops local reductions when they are not effectively reducing the global graph sizes. It is not *always* ideal to apply reductions exhaustively—for example, if only few reductions will succeed and they are costly. During later stages of a data reduction algorithm, local reductions may lead to very few graph changes. Therefore, it may be better to stop local reductions early instead of performing them exhaustively and switch to global, more expensive reductions that may change the graph more significantly. Although the resulting graph is kernel-like, it may be possible to reduce it further. Such a graph is called a *quasi kernel*. Note, however, that this is a trade-off between size of the reduced graph and data reduction speed.

*Plateau/Increasing Transformations.* The general scheme in data reduction is to apply reductions exhaustively until non of the available reductions can be applied anymore. Gellner et al. [77] engineer new generalized data reduction and transformation rules for the weighted independent set problem. A key feature of this work are some transformation rules that can *increase* the size of the input. Surprisingly, these so-called *increasing transformations* can simplify the problem and also open up the reduction space to yield even smaller irreducible graphs later throughout the algorithm. Overall, for the weighted independent set problem, this yields significant speed ups and enables the authors to solve more instances to optimality than previously possible.

*Simple Reductions.* Often the smallest kernels (or seemingly equivalently, the most varied reductions) give the best chance at finding solutions. For instance, the reductions used by Akiba and Iwata [5] for the maximum independent set problem are the *only* ones known to compute an exact solution on certain large-scale graphs, and these reductions are further successful in computing exact solutions in an evolutionary approach [114]. However it is not always beneficial to compute the smallest kernel possible. Fast and simple reductions can compute kernels that are "small enough" for local search to quickly find high-quality, and even exact, solutions much faster than the reductions used to find the smallest kernels [42,51]. Fast and simple reductions can even be used to solve many large-scale instances exactly [155] just as quickly as the algorithm by Akiba and Iwata [5].

*Reduce and Peel.* Lamm et al. [114] showed that including reductions in a branch-andreduce inspired evolutionary algorithm for the independent set problem enables finding exact solutions much faster than provably exact algorithms. To this end, reductions are applied exhaustively. Once a reduced graph is computed, vertices that are unlikely to be in the solution, e.g.,vertices having a very large degree, are removed from the graph and hence excluded from the solution. The algorithm then proceeds recursively. Chang et al. [42] improved on this result by implementing reduction rules to reduce the lead time for kernelization for local search. They introduce "reducing–peeling" to find a large initial solution for local search. This technique can be viewed as computing one path through the search space of a branch-and-reduce algorithm: they repeatedly exclude high-degree vertices and reduce the graph until it is empty, then they take the solution found as an initial solution for local search.

*Limited Reductions.* Sometimes reductions can be very expensive, for example if their running time depends on the number of edges in the neighborhood of a certain vertex. However, as mentioned above it is often not necessary to compute the smallest possible kernel in practice. Hence, a common technique in practice is to exclude such reductions, for example, if the degree of a vertex is too large. An application of this technique is due to Ost et al. [137 SPP] for the vertex ordering problem, where the simplicial vertex reduction rule is limited to vertices of degree at most 18.

*On-the-Fly Reductions.* Data reduction can be used as a preprocessing step to exact algorithms. However, reductions are also used to reduce the size of the search space of local search algorithms without losing solution quality. Dahlum et al. [51] apply a set of simple reductions on the fly for the independent set problem. For this algorithm, they use simple reductions that do not require changing the neighborhoods of vertices. Instead, vertices are marked as removed, e.g.,simplicial vertices. This speeds up local search significantly.

*Parallelization.* A general technique to speed up algorithms is parallelization. Also in data reduction parallelization is used to speed up preprocessing times. For example, "local" reduction rules have been parallelized by using graph partitioning techniques, i.e., each process works on a subgraph and applied reductions only in his subgraph [93]. At the same time, there are also attempts [93] to parallelize more expensive "global" reductions, e.g., reductions that need to access the whole input instance.

*Targeted Branching.* Branch-and-reduce algorithms often make use of vertex selection strategies that are carried over from existing branch-and-bound approaches. However, these selection strategies often do not take into account that removing certain vertices from the graph might result in an increase of the reduction space, which in turn might lead to smaller search trees. Gao et al. [72] thus present a dynamic vertex selection strategy that also takes into account one of their reduction rules and uses a degreebased selection as a fallback. Their experiments indicate that this strategy is able to provide better results when compared to a purely degree-based selection rule.

*Data-Driven Reductions.* Eblen et al. [60] show the benefits of using applicationspecific reduction rules that exploit prior knowledge of the input space. In particular, they use a reduction rule that is based on the empirical evaluation of large transcriptomic graphs and is able to drastically reduce the running time of their algorithm on similar instances. However, this comes at the drawback of a decrease in performance for random graphs.

# **5 Open Problems and Future Work**

We already discussed problem-specific open problems throughout this article. Here, we list some general open questions that apply to a range of problems touched in this survey. For example, in a branch-and-reduce algorithm can we branch to specifically get graphs that reduce better using the available portfolio of reductions? As a concrete example, as stated above, it may be helpful to end up with a lot of independent connected components and to achieve this one may be able to branch on a small vertex separator first. For most problems, what makes an instance hard to reduce is currently unknown, e.g., when does which data reduction rule work well in practice and why? From the theory perspective of a practitioner, it would be better to have an analysis of the expected kernel size, rather than the worst case so as to get more realistic results in practice. One does not always need a single optimal solution, but a diverse set of high-quality solutions. Theoretical approaches for this have been proposed [13], however, they remain untested in practice. Probabilistic reductions have not yet been tried in practice. On the other hand, most of the dynamic techniques that maintain a problem kernel have also not yet been implemented. A problem that needs careful investigation is the order in which reduction rules are applied, e.g., when is it good to apply which reduction rule first? Lastly, consider an instance for a problem on which you already applied all data reduction rules at hand exhaustively. Moreover, assume that you already have an optimal solution on the reduced instance. Is it possible to discover new rules by applying machine learning techniques on such instances?

**Acknowledgement.** Partially supported by DFG grants MN 59/1-1, SCHU 2567/1-2 and SCHU 2567/3-1.

# **References**

	- 67. Fafianie, S., Kratsch, S.: A shortcut to (sun)flowers: kernels in logarithmic space or linear time. In: Italiano, G.F., Pighizzini, G., Sannella, D.T. (eds.) MFCS 2015. LNCS, vol. 9235, pp. 299–310. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3- 662-48054-0\_25
	- 69. Fomin, F.V., Grandoni, F., Kratsch, D.: A measure & conquer approach for the analysis of exact algorithms. J. ACM **56**(5), 25:1–25:32 (2009). https://doi.org/10.1145/ 1552285.1552286
	- 70. Fomin, F.V., Lokshtanov, D., Saurabh, S., Zehavi, M.: Kernelization: Theory of Parameterized Preprocessing. Cambridge University Press, Cambridge (2019). https://doi. org/10.1017/9781107415157
	- 71. Fomin, F.V., Villanger, Y.: Subexponential parameterized algorithm for minimum fillin. SIAM J. Comput. **42**(6), 2197–2216 (2013). https://doi.org/10.1137/11085390X
	- 72. Gao, J., Chen, J., Yin, M., Chen, R., Wang, Y.: An exact algorithm for maximum *k*-plexes in massive graphs. In: Proceedings of IJCAI 2018, pp. 1449–1455 (2018). https://doi.org/10.24963/ijcai.2018/201
	- 73. Gao, W., Friedrich, T., Kötzing, T., Neumann, F.: Scaling up local search for minimum vertex cover in large graphs by parallel kernelization. In: Proceedings of ACAI 2017, pp. 131–143 (2017). https://doi.org/10.1007/978-3-319-63004-5\_11
	- 74. Garey, M.R., Johnson, D.S., Stockmeyer, L.: Some simplified NP-complete problems. In: Proceedings of STOC 1974, pp. 47–63 (1974). https://doi.org/10.1145/800119. 803884
	- 75. Garey, M.R., Johnson, D.S.: Computers and Intractability. W. H. Freeman and Co., San Francisco, Calif. (1979). A Guide to the Theory of NP-Completeness
	- 76. Gawrychowski, P., Mozes, S., Weimann, O.: Minimum cut in *O*(*m*log2 *n*) time. In: Proceedings of ICALP 2020, LIPI, vol. 168, pp. 57:1–57:15 (2020). https://doi.org/10. 4230/LIPIcs.ICALP.2020.57
	- 77. Gellner, A., Lamm, S., Schulz, C., Strash, D., Zaválnij, B.: Boosting data reduction for the maximum weight independent set problem using increasing transformations. In: Proceedings of ALENEX 2021, pp. 128–142. https://doi.org/10.1137/1. 9781611976472.10
	- 78. George, A.: Nested dissection of a regular finite element mesh. SIAM J. Numer. Anal. **10**(2), 345–363 (1973). https://doi.org/10.1137/0710032
	- 79. George, A., Liu, J.W.: The evolution of the minimum degree ordering algorithm. SIAM Rev. **31**(1), 1–19 (1989). https://doi.org/10.1137/1031001
	- 80. Goel, A., Kapralov, M., Khanna, S.: Perfect matchings in *O*(*n*log*n*) time in regular bipartite graphs. SIAM J. Comput. **42**(3), 1392–1404 (2013). https://doi.org/10.1137/ 100812513
	- 81. Gomory, R.E., Hu, T.C.: Multi-terminal network flows. J. Soc. Ind. Appl. Math. **9**(4), 551–570 (1961). https://doi.org/10.1137/0109047
	- 88. Henzinger, M., Noe, A., Schulz, C.: Shared-memory branch-and-reduce for multiterminal cuts. In: Proceedings of ALENEX 2020, pp. 42–55 (2020). https://doi.org/10. 1137/1.9781611976007.4
	- 89. Henzinger, M., Noe, A., Schulz, C., Strash, D.: Practical minimum cut algorithms. ACM J. Exp. Algorithmics **23** (2018). https://doi.org/10.1145/3274662
	- 91. Henzinger, M., Rao, S., Wang, D.: Local flow partitioning for faster edge connectivity. SIAM J. Comput. **49**(1), 1–36 (2020). https://doi.org/10.1137/18M1180335
	- 92. Hespe, D., Lamm, S., Schulz, C., Strash, D.: WeGotYouCovered: the winning solver from the PACE 2019 challenge, vertex cover track. In: Proceedings of CSC 2020, pp. 1–11 (2020). https://doi.org/10.1137/1.9781611976229.1
	- 93. Hespe, D., Schulz, C., Strash, D.: Scalable kernelization for maximum independent sets. J. Exp. Algor. **24**(1), 1–22 (2019). https://doi.org/10.1145/3355502
	- 94. Holtgrewe, M., Sanders, P., Schulz, C.: Engineering a scalable high quality graph partitioner. In: Proceedings of IPDPS 2010, pp. 1–12 (2010). https://doi.org/10.1109/ IPDPS.2010.5470485
	- 95. Iwata, Y., Oka, K., Yoshida, Y.: Linear-time FPT algorithms via network flow. In: Proceedings of SODA 2014, pp. 1749–1761 (2014). https://doi.org/10.1137/1. 9781611973402.127
	- 96. Iwata, Y., Shigemura, T.: Separator-based pruned dynamic programming for Steiner tree. In: Proceedings of AAAI 2019, pp. 1520–1527 (2019). https://doi.org/10.1609/ aaai.v33i01.33011520
	- 97. Jaffke, L., Jansen, B.M.P.: Fine-grained parameterized complexity analysis of graph coloring problems. In: Fotakis, D., Pagourtzis, A., Paschos, V.T. (eds.) CIAC 2017. LNCS, vol. 10236, pp. 345–356. Springer, Cham (2017). https://doi.org/10.1007/978- 3-319-57586-5\_29
	- 98. Jansen, B.M.P.: On sparsification for computing treewidth. Algorithmica **71**(3), 605– 635 (2014). https://doi.org/10.1007/s00453-014-9924-2
	- 99. Jansen, B.M.P., Pieterse, A.: Optimal data reduction for graph coloring using lowdegree polynomials. Algorithmica **81**(10), 3865–3889 (2019). https://doi.org/10.1007/ s00453-019-00578-5
	- 100. Jiang, H., Li, C., Manyà, F.: An exact algorithm for the maximum weight clique problem in large graphs. In: Proceedings of AAAI 2017, pp. 830–838 (2017)
	- 114. Lamm, S., Sanders, P., Schulz, C., Strash, D., Werneck, R.F.: Finding near-optimal independent sets at scale. J. Heurist. **23**(4), 207–229 (2017). https://doi.org/10.1007/ s10732-017-9337-x
	- 116. Lange, J.H., Andres, B., Swoboda, P.: Combinatorial persistency criteria for multicut and max-cut. In: Proceedings of IEEE Conference Computer Vision Pattern Recognition, pp. 6093–6102 (2019). https://doi.org/10.1109/CVPR.2019.00625
	- 117. Langguth, J., Manne, F., Sanders, P.: Heuristic initialization for bipartite matching problems. ACM J. Exp. Algorithmics **15** (2010). https://doi.org/10.1145/1671970. 1712656
	- 138. Padberg, M., Rinaldi, G.: An efficient algorithm for the minimum capacity cut problem. Math. Prog. **47**(1), 19–36 (1990). https://doi.org/10.1007/BF01580850
	- 139. Panagiotas, I., Uçar, B.: Engineering fast almost optimal algorithms for bipartite graph matching: Extended version. Research Report RR-9321, Inria Research Centre Grenoble, Rhône-Alpes (2020). https://hal.inria.fr/hal-02463717
	- 140. Pelofske, E., Hahn, G., Djidjev, H.: Solving large minimum vertex cover problems on a quantum annealer. In: Proceedings of CF 2019, pp. 76–84 (2019). https://doi.org/10. 1145/3310273.3321562
	- 141. Polzin, T.: Algorithms for the Steiner problem in networks. Ph.D. thesis, Universität des Saarlandes, Saarbrücken, Germany (2003). http://scidok.sulb.uni-saarland.de/ volltexte/2004/218/index.html
	- 142. Pothen, A.: The complexity of optimal elimination trees. Technical report, Pennsylvania State University, Department of Computer Science (1988). https://www.cs.purdue. edu/homes/apothen/Papers/shortest-etree1988.pdf
	- 143. Rehfeldt, D., Koch, T.: SCIP-Jack a solver for STP and variants with parallelization extensions: an update. In: Proceedings of OR 2017, pp. 191–196 (2017). https://doi. org/10.1007/978-3-319-89920-6\_27
	- 144. Rehfeldt, D., Koch, T., Maher, S.J.: Reduction techniques for the prize collecting Steiner tree problem and the maximum-weight connected subgraph problem. Networks **73**(2), 206–233 (2019). https://doi.org/10.1002/net.21857
	- 145. Reidl, F., Rossmanith, P., Villaamil, F.S., Sikdar, S.: A faster parameterized algorithm for treedepth. In: Esparza, J., Fraigniaud, P., Husfeldt, T., Koutsoupias, E. (eds.) ICALP 2014. LNCS, vol. 8572, pp. 931–942. Springer, Heidelberg (2014). https://doi.org/10. 1007/978-3-662-43948-7\_77
	- 146. Robertson, N., Seymour, P.: Graph minors. II. Algorithmic aspects of tree-width. J. Algor. **7**(3), 309–322 (1986). https://doi.org/10.1016/0196-6774(86)90023-4
	- 147. Rose, D.J.: Triangulated graphs and the elimination process. J. Math. Anal. Appl. **32**(3), 597–609 (1970). https://doi.org/10.1016/0022-247X(70)90282-9
	- 148. Sanders, P., Schulz, C.: KaHIP v3.00 Karlsruhe High Quality Partitioning User Guide. Technical report (2013). https://arxiv.org/abs/1311.1714
	- 149. Schäffer, A.A.: Optimal node ranking of trees in linear time. Inf. Proc. Lett. **33**(2), 91–96 (1989). https://doi.org/10.1016/0020-0190(89)90161-0
	- 151. Seidman, S.B., Foster, B.L.: A graph-theoretic generalization of the clique concept. J. Math. Sociol. **6**(1), 139–154 (1978). https://doi.org/10.1080/0022250X.1978.9989883
	- 152. Shinano, Y., Rehfeldt, D., Koch, T.: Building optimal steiner trees on supercomputers by using up to 43,000 cores. In: Rousseau, L.-M., Stergiou, K. (eds.) CPAIOR 2019. LNCS, vol. 11494, pp. 529–539. Springer, Cham (2019). https://doi.org/10.1007/978- 3-030-19212-9\_35
	- 153. Stallmann, M.F., Ho, Y., Goodrich, T.D.: Graph profiling for vertex cover: targeted reductions in a branch and reduce solver. Technical report (2020). https://arxiv.org/abs/ 2003.06639
	- 154. Stoer, M., Wagner, F.: A simple min-cut algorithm. J. ACM **44**(4), 585–591 (1997). https://doi.org/10.1145/263867.263872
	- 155. Strash, D.: On the power of simple reductions for the maximum independent set problem. In: Dinh, T.N., Thai, M.T. (eds.) Proccedings of COCOON 2016. LNCS, vol. 9797, pp. 345–356. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-42634- 1\_28

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Skeleton-Based Clustering by Quasi-Threshold Editing**

Ulrik Brandes<sup>1</sup> , Michael Hamann2(B) , Luise Hauser ¨ 2, and Dorothea Wagner2

<sup>1</sup> ETH Zurich, Z ¨ urich, Switzerland ¨ ubrandes@ethz.ch <sup>2</sup> Karlsruhe Institute of Technology, Karlsruhe, Germany michael@content-space.de, ufziw@student.kit.edu, dorothea.wagner@kit.edu

**Abstract.** We consider the problem of transforming a given graph into a quasithreshold graph using a minimum number of edge additions and deletions. Building on the previously proposed heuristic Quasi-Threshold Mover (QTM), we present improvements both in terms of running time and quality. We propose a novel, linear-time algorithm that solves the inclusion-minimal variant of this problem, i.e., a set of edge edits such that no subset of them also transforms the given graph into a quasi-threshold graph. In an extensive experimental evaluation, we apply these algorithms to a large set of graphs from different applications and find that they lead QTM to find solutions with fewer edits. Although the inclusionminimal algorithm needs significantly more edits on its own, it outperforms the initialization heuristic previously proposed for QTM.

**Keywords:** Quasi-threshold graph · Trivially perfect graph · Graph editing · Graph clustering · Community detection

# **1 Introduction**

We consider the problem of clustering a graph by partitioning its nodes. Especially in the context of social networks, this problem is often referred to as community detection. The approach taken here is to view community detection as a graph modification problem. Specifically, Nastos and Gao [25] proposed to edit a given graph into a quasithreshold graph and use its connected components to determine the clustering.

A quasi-threshold graph, also known as trivially perfect graph, is the transitive closure of a rooted forest [33], which can in turn be considered a skeleton of the graph. Figure 1 shows an example, and we provide a more detailed motivation for this particular approach in the next section.

As minimizing the number of edits is *N P*-hard [25], the Quasi-Threshold Mover (QTM) heuristic [4 SPP] starts from some rooted forest on the nodes of the input graph and moves nodes within and between trees to reduce the edit distance between the input graph and the transitive closure of the forest.

Several improvements to QTM are proposed in this chapter. We reduce the running time of one round of node moves to linear and show that the edits incident to a single node can be minimized using an additional path sorting step. This ultimately

**Fig. 1.** Example quasi-threshold graph. The skeleton is denoted by thick edges, its transitive closure is dashed, the root is the gray node. (Color figure online)

leads to a linear-time algorithm for inclusion-minimal sets of edits. To also find smaller solutions, we propose a randomization of local moves. From an extensive experimental evaluation on empirical graphs we conclude that our modifications yields substantial improvements over the original QTM algorithm in terms of the size of the edit set.

# **2 Preliminaries**

We consider simple undirected graphs *G* = (*V,E*) consisting of *n* := |*V*| nodes *V* that are connected by a set of *m* := |*E*| edges *E* ⊂ -*V* 2 , i.e., without self-loops or multi-edges. By *N*(*u*) we denote the set of neighbors of *u* ∈ *V* and deg(*u*) := |*N*(*u*)| its degree. Further, let *N*[*u*] := *N*(*u*)∪ {*u*} be the closed neighborhood of *u*. The subgraph induced by a set of nodes *X* ⊂ *V* is denoted *G*[*X*]. With *Kn*, *Pn*, and *Cn* we denote the complete graph, path, and cycle on *n* nodes, respectively. These will be important as induced subgraphs, and we write, say, *kKn* for *k* copies of *Kn*.

Quasi-threshold graphs are graphs that contain neither a *P*<sup>4</sup> nor a*C*<sup>4</sup> as node-induced subgraphs [34]. This is equivalent to an inductive construction in which the base case is a single node and there are two construction operators: either a universal node (adjacent to all previous nodes) is added, or the disjoint union of two quasi-threshold graphs is formed. The inductive construction of a quasi-threshold graph immediately gives rise to its skeleton forest referred to in the introduction.

#### **2.1 Motivation**

Many tasks in network analysis can be understood as first establishing an ideal, and then recovering that ideal from an empirical situation or at least determining a degree to which that ideal is met.

Take the most elementary notion, network density, as an example. The two idealized situations, polar opposites of one another, are graphs *nK*<sup>1</sup> of isolated nodes and cliques *Kn* 1. The number of edges in a graph is a straightforward measure of distance from the ideal case of isolated nodes on an absolute scale of measurement. Since the number *n* 2 of edges in a clique varies with the number nodes *n*, *density* is often defined as the relative number *m/ n* 2 of edges.

<sup>1</sup> It should not be lost on the reader that both names have social connotations [22].

A formulation of community detection using the same kind of reasoning can be developed as follows. Motivations for the vast majority of community detection methods generally state that the intention is to partition a graph into relatively dense subgraphs that are sparsely connected between them [17,28]. The idealized situation, with an undisputed partition into communities, is a *cluster graph* defined as disjoint unions of cliques or, equivalently, *P*3-free graph. Each connected component is a clique, and these cliques are isolated from each other.

How far is a given graph from a cluster graph, and where are its communities? On an absolute scale, distance to the ideal situation is measured by counting the number of edges that need to be added or deleted to complete cliques and make those cliques independent. In *cluster editing*, a cluster graph of minimum edit distance is sought, and its cliques induce a clustering of the original graph. The normalized number of edges that do not have to be edited is known as *performance* [13], and cluster editing is a special case of *correlation clustering* [1]. Numerous other clustering approaches are based on objective functions that normalize the difference between a graph and the cluster graph ideal by taking additional factors such as the number of clusters, size of clusters, degree in clusters, etc. into account.

Like cluster graphs, quasi-threshold graphs represent an idealized situation, which we can think of as intersecting communities. To see this, we take two additional steps.

We start from *split graphs*, which are defined as those graphs that have a partition *V* = *C P* into a clique *G*[*C*] and an independent set *G*[*P*], or, equivalently, (2*K*2*,C*4*,C*5)-free graphs. They represent the ideal case of a core-periphery structure [3], and are characterized by their degrees: if *n > d*<sup>1</sup> ≥··· ≥ *dn* ≥ 0 is the degree sequence of a graph, then it is a split graph if and only if the *k*th Erdos-Gallai inequality ˝ ∑*k <sup>i</sup>*=<sup>1</sup> *di* <sup>≤</sup> *<sup>k</sup>*(*<sup>k</sup>* <sup>−</sup> <sup>1</sup>) + <sup>∑</sup>*<sup>n</sup> <sup>j</sup>*=*k*+<sup>1</sup> min{*k,dj*} is actually an equality for the *corrected Durfee number h* = max{*k* : *dk* ≥ *k* − 1}. In this case, *h* nodes of highest degree induce the clique and the others an independent set. The minimum number of edges that need to be edited to turn a graph into a split graph, known as its *splittance* [19], is half the difference between the two sides of the defining inequality, <sup>1</sup> <sup>2</sup> (∑*<sup>n</sup> <sup>j</sup>*=*h*+<sup>1</sup> *dj* <sup>−</sup>∑*<sup>h</sup> <sup>i</sup>*=<sup>1</sup> *di*). These edits can be chosen so that *h* nodes of highest degree induce the clique and the remaining nodes the independent set. The computationally easy problem of split editing becomes intractable, for instance, if adapted for density [5] instead of edge numbers or for multiple cores [6].

By distinguishing a core from a periphery, split graphs also distinguish nodes that are central (as members of the core) from others that are not (as members of the periphery). Every node in the periphery is adjacent only to nodes in the core, and every node in the core is adjacent to all other core nodes. Hence, the neighborhood of any periphery node *u* ∈ *P* is a subset of the closed neighborhood of any core node *v* ∈*C*, *N*(*u*) ⊆ *N*[*v*]. This binary classification can be refined by comparing all pairs of nodes according to this neighborhood inclusion property, known as the vicinal preorder [16]. Schoch and Brandes [30] have shown retrospectively that this is actually the common ground of standard notions of centrality.

Graphs characterized by a total vicinal preorder are called *threshold* or *nested graphs* [23,24] and therefore the ideal structure of an undisputed ranking of nodes by centrality. Threshold editing is intractable, even if the input is a split graph [14], and for a number of reasons centrality has been defined via an abundance of indices rather than the node ranking in a closest threshold graph.

Threshold graphs are the (2*K*2*,C*4*,P*4)-free graphs and therefore a subclass of quasithreshold graphs. They can be constructed by adding one node at a time, either as a universal or isolated node, so that they have a skeleton that is a caterpillar. Each connected component of a quasi-threshold graph, in turn, can be seen as a group of nested graphs that intersect at their cores but may branch out into different peripheries.

We have thus motivated quasi-threshold graphs as idealized structures of (disjoint groups of) intersecting communities. Quasi-threshold editing yields a partition into communities and in addition for each of them a centralized nesting structure represented by their skeleton tree.

#### **2.2 Related Work**

Quasi-threshold graphs can be recognized in linear time [8,34,4 SPP]. While the first algorithm [34] computes a skeleton forest if *G* is a quasi-threshold graph, the other [8, 4 SPP] additionally computes a forbidden subgraph if *G* is not.

As mentioned, quasi-threshold editing is *N P*-hard [25]. Due to its characterization via a finite set of finite forbidden subgraphs, it is fixed-parameter tractable in the number of edits *k* [7]. In combination with the certifying recognition in linear time, this yields a simple *<sup>O</sup>* (6*<sup>k</sup>* <sup>×</sup> (*<sup>n</sup>* <sup>+</sup> *<sup>m</sup>*)) time algorithm. For the related problem of quasi-threshold deletion, where edges may be deleted but not added, improved branching rules have been proposed, reducing the running time from *<sup>O</sup>* (4*<sup>k</sup>* <sup>×</sup>(*n*+*m*)) to *<sup>O</sup>* (2*.*42*<sup>k</sup>* <sup>×</sup> (*<sup>n</sup>* <sup>+</sup> *<sup>m</sup>*)) [21]. Further, ordered enumeration of solutions is also possible with FPT delay [10]. A polynomial kernel of *O* (*k*7) nodes has been introduced by Drange and Pilipczuk [15], who also show that the problem cannot be solved in time 2*ok* <sup>×</sup>*n<sup>O</sup>* (1) unless the Exponential Time Hypothesis fails.

The first editing heuristic has been proposed by Nastos and Gao [25]. With Quasi-Threshold Mover [4 SPP], the first editing heuristic with a running time close to linear has been proposed. Recently, a study on techniques for computing exact solutions has been published [18 SPP].

For the superclass of cographs, or *P*4-free graphs [9], the problem of inclusionminimal editing has recently been considered [11]. Instead of asking for a set of edge edits of minimum cardinality, it asks for a set of edge edits such that no proper subset yields a cograph. While cograph editing is also *N P*-hard [20], inclusion-minimal cograph editing can be solved in linear time [11].

# **3 Quasi-Threshold Mover (QTM)**

The Quasi-Threshold Mover algorithm, short QTM, iteratively improves the skeleton forest to heuristically minimize the number of induced edits. It starts with a given skeleton forest, this may be the trivial skeleton where every node is a root which implies that all edges are deleted. In each round, it iterates over all nodes *u* in a random order and possibly moves *u* to a new position in the forest if this decreases the number of induced edits. For this, it considers every node *v* ∈ *V* \ {*u*} as a parent for *u*. Further, a subset of the children of the new parent *v* may be adopted, i.e., moved below *u*. In the induced quasi-threshold graph, *u* is then connected to *v* and all its ancestors as well as to all adopted children and their descendants. Every neighbor *x* of *u* in this set of nodes saves deleting the edges {*u,x*} but every non-neighbor *y* of *u* implies inserting the edge {*u,y*}. Therefore, we select the parent *v* and the children to adopt such that the number of *u*neighbors minus non-neighbors is maximized. Given a potential parent *v*, we always adopt children whose subtrees contain more neighbors than non-neighbors of *u*. We call those children *close children*. Using a DFS, we could determine for every node how many neighbors and non-neighbors are above/below that node and thus allowing the selection of the best parent and which children are close. However, this gives a quadratic running time per round. Instead, QTM starts limited local searches starting from the neighbors of *u*. They only visit one or two non-neighbors per neighbor. The idea is that whenever a subtree contains more neighbors than non-neighbors, it will be fully visited. Thus, the algorithm is able to determine all close children. Similarly, the best parent is determined by propagating information upwards in the skeleton. As QTM uses a priority queue to manage nodes during this bottom-up search, the running time per round is *O* (*n*+*m*logΔ), where Δis the maximum degree.

In the following, we present several novel improvements for QTM. In Sect. 3.1, we show how to reduce the running time per round to linear in the number of nodes and edges. Further, in Sect. 3.2, we present an additional path sorting step that modifies the skeleton forest before every local move of a node *u* and yields a move that is optimal with respect to the edits incident to *u*. This local optimality directly gives us an inclusion-minimal algorithm as we show in Sect. 3.3. The last improvement is randomization, in Sect. 3.4, we show how we can select uniformly at random among all possible sets of edits incident to the moved node *u*.

# **3.1 Linear Running Time**

To realize its bottom-up search, QTM needs to process nodes ordered by depth in the forest. While it is straightforward to use a bucket per level in the forest, this has the problem that this yields a running time linear in the depth of the deepest neighbor of *u*. It turns out, though, that we do not need to consider deep neighbors. More precisely, we show that we can ignore neighbors at a depth of more than 2×deg(*u*). Consider a node *v* ∈ *N*(*u*) with depth larger than 2 × deg(*u*). If *u* is connected to *v* in the edited graph, this implies that *u* is also connected to all at least 2 × deg(*u*) ancestors of *u*. Among these, there can be at most the remaining deg(*u*) − 1 neighbors and thus at least deg(*u*)+1 non-neighbors. Thus, this implies deg(*u*)+1 edge insertions. Making *u* a root in the forest, i.e., deleting all edges incident to *u*, causes just deg(*u*) edits and is thus better. Therefore, we can ignore neighbors with depth larger than 2×deg(*u*). We can thus use a bucket per depth of the remaining neighbors which eliminates the logfactor of the running time.

#### **3.2 Sorting Simple Paths**

QTM minimizes edits with respect to the choice of a parent and adopted children of that parent. Here we show that an additional sorting step minimizes the edits incident to *u* in the edited graph independently of the chosen skeleton forest. For this, we consider *simple paths*, which we define as a maximal path in the skeleton forest in which each node has exactly one child except for the lowest node. Every node is thus part of exactly one simple path, which may only consist of the node itself. A crucial observation is that reordering nodes in simple paths is the only way the skeleton forest can be modified without affecting the induced quasi-threshold graph.

**Lemma 1.** *Let G be a graph and T a corresponding skeleton forest. It holds that N*[*u*] = *N*[*v*] *if and only if u and v are on the same simple path.*

*Proof.* If *N*[*u*] = *N*[*v*], then *u* and *v* are on the same simple path:

Assume otherwise, i.e., that *u* and *v* are not on the same simple path in *T*. Consider the path *Puv* between *u* and *v* in *T*. As it is not simple, it contains a node that is not its lowest node and has at least a child *x* that is not on *Puv*. As *x* is not on *Puv*, either {*u,x*} ∈ *E* and {*v,x*} ∈*/ E* or vice-versa, depending on whether *u* is an ancestor of *v* or *v* an ancestor of *u*. This is a contradiction to *N*[*u*] = *N*[*v*], thus *u* and *v* must be on the same simple path.

If *u* and *v* are on the same simple path, *N*[*u*] = *N*[*v*]:

An edge {*u,v*} exists if and only if *u* and *v* are in an ancestor-descendant relationship in the skeleton *T*. Consider a node *u*. All ancestors/descendants of *u* apart from its simple path are also ancestors/descendants of all other nodes in its simple path. Further, the nodes in its simple path form a clique. Therefore, *N*[*u*] = *N*[*v*]. 

**Lemma 2.** *Let T , T be two different skeletons that induce the same quasi-threshold graph G. Then the simple paths of u in T and T consist of the same nodes.*

*Proof.* Assume otherwise, i.e., that the simple paths of *u* in *T* and *T* differ. Then there is a node *x* that is on the simple path of *u* in *T* but not in *T* (or vice-versa, but assume w.l.o.g. that it is in *T*). As *x* and *u* are on the same simple path in *T*, *N*[*u*] = *N*[*x*] by Lemma 1. Lemma 1 also implies that *u* and *x* must be on the same simple path in *T* , which is a contradiction to the existence of *x* and thus our assumption. Thus, the simple paths of *u* must consist of the same nodes in *T* and *T* . 

**Lemma 3.** *Let T , T be two skeletons that induce the same quasi-threshold graph G. Then the only difference between T and T is the reordering of simple paths.*

*Proof.* Assume otherwise, i.e., that there were two skeletons *T*, *T* that imply the same quasi-threshold graph *G* but differ more than just reordering of simple paths. A forest is uniquely determined by specifying the set of ancestors of every node. Thus there must be a node *u* such that the ancestors of *u* in *T* are different from the ancestors in *T* . As a consequence, there is a node *v* that is an ancestor of *u* in *T* or *T* , but not in both. Assume w.l.o.g. that *v* is an ancestor of *u* in *T*. Due to *T*, {*u,v*} ∈ *E*. As {*u,v*} ∈ *E* if and only if *v* is an ancestor of *u* or *v* is a descendant of *u*, *v* must be a descendant of *u* in *T* . As *u* is an ancestor of *v* in *T*, *N*[*u*] ⊇ *N*[*v*]. As *v* is an ancestor of *u* in *T* , *N*[*v*] ⊇ *N*[*u*] and thus *N*[*u*] = *N*[*v*]. Due to Lemma 1, this implies that *u* and *v* are together on a simple path in both *T* and *T* .

By Lemma 2, the simple path of *u* must consist of the same nodes in *T* and *T* . Therefore, we can replace the simple path in *T* by the simple path in *T* without altering the resulting graph, and then search for a new pair *u*, *v* as described above. This reordering of the simple path does not change any other simple path. Therefore, if we apply this procedure repeatedly, it cannot find the same nodes again. Thus, this procedure terminates after at most *n* steps. As every step just reorders a simple path, the only difference between *T* and *T* was reordering of simple paths. 

The main idea of path sorting is, before moving a node *u*, to move all its neighbors to the top of their respective simple paths. Since it might unify simple paths and thus enable reordering, *u* is first removed from the graph. This reordering makes it possible to choose the lowest neighbor of a simple path as parent without needing to insert edges to other non-neighbors in it. Note that the order in simple paths does not play a role when adopting a node *c* as a child because all nodes in its path become neighbors of *u* anyway. We show that this minimizes the number of edits incident to *u* by considering an optimal set of edits and its skeleton forest and showing that our forest with reordered simple paths does not yield more edits.

**Lemma 4.** *Consider a graph G* = (*V,E*)*, a node u* ∈ *V and a skeleton forest T . Applying QTM to u on T* − *which is T with u removed and simple paths reordered such that neighbors of u are at the top of their simple paths minimizes the number of edits incident to u.*

*Proof.* Let *Q* be the quasi-threshold graph with minimum edits incident to *u* and *TQ* a skeleton forest of *Q*. Let *T* − *<sup>Q</sup>* be *TQ* without *u*, children of *u* attached to *u*'s parent. This keeps all ancestor-descendant-relationships between all nodes except *u* and thus all remaining edges. The reverse of this operation is exactly what QTM does: choosing a parent and potentially adopting some of its children. Thus, QTM can find an optimal set of edits incident to *u* in *T* − *<sup>Q</sup>* . Since, by Lemma 3, *T* <sup>−</sup> and *T* <sup>−</sup> *<sup>Q</sup>* differ only in the order of nodes on simple paths, we show that the orderings of *T* − and *T* − *<sup>Q</sup>* are equally good.

Consider the parent *p* and children *C* of *u* in *TQ*. If *p* is the lower end of its simple path in *T* − *<sup>Q</sup>* , we obtain the same ancestor from the lowest node of *p*'s simple path in *T* <sup>−</sup>. Similarly, for an adopted child *c* ∈ *C*, adoption of the highest node of *c*'s simple path in *T* − yields the same descendants. If *p* is not the lower end of its simple path in *T* − *Q* , we distinguish two cases: *u* adopted *p*'s only child or *u* is a leaf node in *TQ*. In the first case, neither the position in *p*'s simple path nor its node order matters as any position and node order gives the same neighbors and thus edits. If *u* is a leaf node in *TQ* and *p* is not the lower end of its simple path, the node order matters as *u* is only connected to *p* and *p*'s ancestors but not the nodes below *p* on *p*'s simple path *Pp* in *T* <sup>−</sup> *<sup>Q</sup>* . By Lemma 3, *Pp* also exists in *T* <sup>−</sup> *<sup>Q</sup>* . Every non-neighbor of *u* among *p* and its ancestors in *Pp* causes an edge insertion while every neighbor of *u* below *p* causes an edge deletion. By moving all neighbors of *u* to the top of *Pp* and choosing the lowest neighbor of *u* on *Pp* as parent, we do not get any edits incident to nodes of *Pp* and thus minimize the edits among all possible orderings of *Pp*. This shows that QTM finds a parent and children to adopt on *T* <sup>−</sup> that minimize the number of edits incident to *u*. 

What remains to show is that maintaining and sorting all simple paths does not increase the asymptotic running time of QTM. Simple paths are maintained explicitly in a dynamic array, every node stores its simple path and position in it. This allows us to swap neighbors of the node to move *u* in constant time to the position of the first non-neighbor in its simple path, we also store this position. Moving nodes can cause simple paths to be split or joined. We store simple paths ordered from lowest to highest node. Whenever simple paths are split or merged, *u* is adjacent in the edited graph to the upper part of the path either before or after the move. In a split, we remove the upper part of the path from its end. In a merge, we add the nodes of the upper path to the lower path. Both operations are thus linear in the number of neighbors of *u* before or after the move. The running time analysis of QTM already accounts for running time linear in the number of neighbors of *u* in the edited graph both before and after the move. Thus, path sorting does not increase the asymptotic running time of QTM.

#### **3.3 Inclusion-Minimal Editing**

With the local moving routine of QTM, we can incrementally insert the nodes of a graph *G* into an initially empty graph. Due to Lemma 4, this minimizes the number of edits in each step. Overall, this yields an inclusion-minimal editing of *G*, as it has also been shown, e.g., for interval graphs [26]. The basic idea is that if there was a set of superfluous edits, these edits could have been omitted already at the steps where they were introduced, violating the local minimality guaranteed by Lemma 4.

This inclusion-minimal editing algorithm can also be considered a one-pass streaming algorithm. To add a node, we need the skeleton of the already seen nodes, which can be stored in *O* (log(*n*)) bits per node. We only consider the edges of every node once, the only constraint is that when we encounter a node *u* in the stream, we also need to get all incident edges that are incident to the already seen nodes.

#### **3.4 Randomized Choices**

To accelerate convergence, the original QTM algorithm moves a node *u* only if this reduces the number of edits and there is no rule for breaking ties between moves. The algorithm also never adopts children whose subtrees contain an equal number of neighbors and non-neighbors, as this only swaps edge deletions for insertions. We call such children *indifferent children*. We now propose to break ties by choosing uniformly at random from the best options for *u*, even if this does not lead to an improvement. The rationale is that on a plateau of equally good solutions only some may lead to better solutions in the next move. The same technique can also be applied to the inclusionminimal editing, where a more diverse set of solutions can be obtained.

This poses two challenges: we need to find all options, and we may count each of them only once. In particular, different choices of a parent *p* and children *C* to adopt might actually yield the same quasi-threshold graph and thus only one of them should be considered. For instance, choosing a parent *x* without adopting any children is the same as choosing *x*'s parent *p* as parent and adopting *x*. But we also cannot disregard *p*, because adopting a second child of *p* would yield a different quasi-threshold graph.

Since, according to Lemma 2, the set of simple paths is unique, we can resolve the ambiguity by ensuring that a node *u* that is moved is inserted at the bottom of its new simple path. The lowest node of a simple path does not have exactly one child, for otherwise the path would not end there. Accordingly, we ignore positions where *u* adopts exactly one child.

Thus, if a potential parent *p* has exactly one close child and no indifferent children, we disregard it. If *p* has one close child, we must choose at least one indifferent child and thus get 2*ci* <sup>−</sup>1 possibilities to choose among the *ci* indifferent children. If *<sup>p</sup>* has at least two close children, we can choose an arbitrary subset of the indifferent children and get 2*ci* possibilities. If *p* has no close children and at most one indifferent child, we have only the single option of not adopting the child. If *p* has no close children and *ci* <sup>≥</sup> 2 indifferent children, we have 2*ci* <sup>−</sup>*ci* possible choices among the indifferent children that do not lead to exactly one child.

In our algorithm, we propagate the number of choices upwards in our bottom-up search together with the minimum number of required edits and the best parent. When processing a node that is a suitable parent that achieves the same number of edits, we choose it with a probability that is proportional to its number of choices for adopting children divided by the total number of choices aggregated so far. As the number of choices is exponential in the number of indifferent children, we store the logarithm of the number of choices to avoid overflows or dealing with huge integers. While this introduces rounding errors when adding numbers that are of different orders of magnitude, in these cases the chances of choosing one parent instead of the other are vanishingly small anyway.

QTM already guarantees that we discover nodes whose subtree contains as many neighbors as non-neighbors, so it is easy to select them. There are some cases though, where QTM needs to be modified to propagate information about equally good parents. In particular, this is the case if the current candidate so far shows no benefit over isolating the node to move. In that case, QTM does not propagate any information as there is always an ancestor of the current node that is at least as good. We adapt QTM to also propagate information about equally good parents even if the number of saved edits is 0. We however do not insert the parent *p* into the priority queue unless there is an actual improvement over isolating the node to move. The reason for this is that if *p* is a non-neighbor, it causes an additional edit that leads to −1 saved edits. This cannot be compensated further up in the tree, as otherwise the path above *p* to the root contained more *u*-neighbors than non-neighbors and choosing the parent of *p* as parent of *u* and not adopting any children was better.

# **4 Experimental Evaluation**

We added our extensions to the original QTM implementation in C++ as part of NetworKit [31 SPP] 2. All experiments were performed on an Intel Core i7-2600K CPU with 32GB RAM. Each algorithms was executed ten times with ten different seeds and randomly permuted node ids. By *instance*, we denote a combination of seed and (permuted) graph.

<sup>2</sup> Our implementation is available at https://github.com/michitux/networkit/tree/upstream/qtmlinear.

**Fig. 2.** Comparison of the different variants of QTM on the COQ protein similarity dataset with either no initialization or the initialization heuristic. Lines ending with a "x" are algorithms that need edits for instances that are quasi-threshold graphs and are thus infinitely worse than the best algorithm.

Our algorithms are evaluated on two datasets. The first consists of 3964 connected components of the COG protein similarity data [2,27]. Each connected component consists of a symmetric matrix of similarities, and we construct an unweighted graph from its non-negative entries. Even though the dataset does not include fully connected components (i.e., cliques), 1666 components remain that are quasi-threshold graphs and do not require any edits. We restrict parts of our analysis to the 716 graphs that require at least 20 edits. As a second dataset we use 100 social networks of Facebook friendships at US universities and colleges [32].

Unless noted otherwise, QTM is run for a maximum of 400 iterations. We stop early if an iteration does not result in a node movement. With randomization enabled, however, we do continue for up to 50 iterations without improvements if nodes had more than one option.

We use so-called *performance profiles* [12] to compare the number of edits achieved by different algorithms. A performance profile indicates the fraction of instances on which an algorithm performed within a specified percentage of the best algorithm with the best seed on that graph. For readability, we sometimes divide the plots by vertical lines indicating intervals of the *x*-axis with different linear scales.

#### **4.1 Sorting Paths and Randomization**

We first examine the impact of sorting paths and randomization on the number of edits. In a 2×2×2-design, we combine no initialization (a spanning forest of isolated nodes) and the previous initialization heuristic [4 SPP] with iterations that make or do not make use of path sorting and randomization.

Figure 2 shows the results for the full COG protein similarity dataset. Despite the many instances that are, or are almost, quasi-threshold graphs, clear differences arise, with the old variants performing the worst. As is to be expected, quasi-threshold graphs are not always recognized without initialization. The variants with just sorting follow with some margin. Here, the difference between the two initialization algorithms is

**Fig. 3.** Comparison of the different variants of QTM on the Facebook 100 dataset with either no initialization or the initialization heuristic.

almost gone and no quasi-threshold graph requires any edits. This is not the case right after the first iteration, though, so we can rule out that one iteration of this algorithm is an alternative inclusion-minimal algorithm.

The versions with just randomization performs even better than with just sorting paths. However, here, some graphs are not recognized as quasi-threshold graphs and a clear gap between the two initializations remains. With path sorting and randomization, the performance is even better, regardless of the initialization 95% of the instances are as good as the best algorithm and seed, and almost all instances are within 10% of the best solution.

For the Facebook 100 dataset, the results that are shown in Fig. 3 are slightly different. First, the instances are much more challenging with even the smallest requiring more than ten thousand edits. There are slight differences between the different solutions which mean that usually there is just one seed and algorithm that achieves the best result on a graph, explaining why no algorithm has the best solution for more than 10% of the instances. Also, we are no longer talking about 10% differences in the number of edits, but at most 2.5%. Still, there are clear differences between the algorithm variants. The original two variants need at least 0.5% more edits than the best solutions on almost all instances while the variants with path sorting and randomization need at most 0.5% more edits than the best solutions on almost all instances. The variants with path sorting give a good improvement, but unlike in the COG protein similarity dataset, the differences between the initializations remain. With randomization, the difference between the two initializations is even larger than the difference between using just randomization and using both path sorting and randomization.

Overall, we can conclude that using path sorting and randomization significantly improve the quality of the solutions. However, on the Facebook 100 dataset, initialization still seems to make a difference, indicating that even with these improvements we are not able to escape all local minima. Next, we consider the inclusion-minimal editing as initialization.

**Fig. 4.** Comparison of the different initializations of QTM on the COQ protein similarity dataset (top) and the Facebook 100 dataset (bottom).

#### **4.2 Initialization and Convergence**

Apart from the two original initialization methods, we consider three variants of the inclusion-minimal editing that differ based on the order in which nodes are inserted. We consider a random order and descending or ascending by degree. For the inclusionminimal initialization, we also consider randomization of the chosen position in the skeleton.

First, we consider just the initialization itself in Fig. 4 for both datasets. Both plots use as "best algorithm" the algorithm runs with the up to 400 iterations. For the COQ protein similarity dataset, we can see that even just the initialization algorithms also match some of the best solutions, which is to be expected as some of them require no edits. No initialization corresponds to just deleting all edges, and we can see that for some graphs this is very far from an optimal solution. The inclusion-minimal variants clearly need less edits than the initialization heuristic, with the randomized order being best and a not so clear distinction between ascending and descending order. Interestingly, the variants without additional randomization seem to perform slightly better.

On the Facebook 100 dataset, a large fraction of the edges is edited, such that even just deleting all edges is less than 50% worse than what the best algorithm achieves.

**Fig. 5.** Comparison of the different initialization variants of QTM on the COQ protein similarity dataset (top) and the Facebook 100 dataset (bottom) with both path sorting and randomization after up to 400 iterations and after 20 iterations.

The initialization heuristic actually needs more edits than just deleting all edges, an observation already made by Brandes et al. [4 SPP]. The inclusion-minimal initialization algorithms perform much better than that, even though they do not match any best results. Again, the randomized order is best, following by ascending and then descending degree order. We can also clearly see again that not randomizing the choices is slightly better. This indicates that there might be potential for further optimizing choices in the inclusion-minimal editing algorithm.

Next, we consider how the choice of the initialization algorithm influences the result after 20 iterations or up to 400 iterations with both path sorting and randomization enabled. Figure 5 compares the results for both datasets. For the COQ protein similarity dataset, the results are very close. The initialization heuristic wins both after 20 and 400 iterations, the inclusion-minimal editing with randomized order comes second. The remaining variants follow, with the descending degree ordering being last. The differences are small, though, and in some cases the initialization seems to be more important than the number of iterations.

**Fig. 6.** Number of iterations used by QTM. The COQ protein similarity dataset only includes graphs that require at least 20 edits. Whiskers extend to the 5th and 95th percentile.

For the Facebook 100 dataset, the difference between 20 and 400 iterations is clearly visible. While the initialization heuristic clearly wins, the number of iterations seems more important than the initialization. This can be explained by the much larger and more difficult graphs that also require more iterations as shown in Fig. 6. Here, we exclude the ascending and descending degree ordered initialization to improve readability.

Without randomization, most instances of the COQ protein similarity dataset converge within 10 iterations. On the Facebook 100 dataset, those algorithms require up to 40 iterations for most instances to converge. Enabling path sorting decreases the number of required iterations. As the initialization is not counted as an iteration, it is natural that variants without initialization take an iteration longer in the median on the COQ protein similarity dataset. The difference between the initialization heuristic and the inclusion-minimal editing as initialization is small. With randomization enabled, most instances of the Facebook 100 dataset use all 400 iterations that we allowed. With path sorting enabled, some more instances converge earlier, i.e., either no move was possible – which is unlikely here – or no improvement has been found for 50 iterations. For the COQ protein similarity dataset, most instances finish in a bit more than 100 iterations. Again, this is less with path sorting.

We conclude that the initialization heuristic introduced by Brandes et al. [4 SPP] is still unmatched in results even though it is initially worse than the new inclusionminimal variants. For the inclusion-minimal editing, a random node order seems to perform best. Path sorting leads not only to better results of QTM, but also faster convergence. Randomization leads to a much larger number of iterations that yield some improvements. Here, limiting the number of iterations is required to achieve reasonable running times but still even with 20 iterations, randomization improves results.

**Fig. 7.** Running time per edge and iteration vs. number of edges of QTM with initialization heuristic, path sorting and randomization on graphs of the two datasets requiring at least 20 edits.

#### **4.3 Running Time**

Figure 7 shows the running time per edge and iteration in microseconds for QTM with initialization heuristic, randomization and path sorting. Although this includes the time for initialization, we normalized by the number of subsequent iterations. Since initialization time is dominated by the iterations, which in turn are linear in the number of edges, this normalized running time should be roughly constant. For the COQ protein similarity dataset, it actually decreases with increasing graph size. Given that this happens in the range where these graphs have only hundreds of edges, initialization overheads might play a role. For the Facebook 100 dataset, running times actually increase slightly with graph size. Between the smallest and the largest graph, we see an increase from around 0.4µs to 0.6µs. We examined CPU statistics and found increased cache misses to be a likely explanation. The percentage of cache misses increases while the number of instructions per edge and iteration is almost constant across the Facebook 100 dataset.

# **5 Conclusion**

We have extended the fast quasi-threshold editing heuristic QTM with new path sorting and randomization components. We have shown that path sorting both provides new local optimality guarantees in theory and better results in practice. Our experimental results indicate that randomization indeed helps escaping local optima, but convergence needs much longer, in particular for large graphs. Still, even with few iterations, results are improved in practice. We also modified QTM into a linear-time algorithm for inclusion-minimal edit sets, which serve well as initialization for QTM. While it reduces the number of edits compared to the previous initialization heuristic, the final result after convergence are slightly worse.

Therefore, it would be interesting to investigate further ways to escape local minima, e.g., by moving several nodes at once by some form of contraction. A recent master's thesis [29] extends QTM to the weighted quasi-threshold editing problem where every node pair has a cost and the goal is to find a set of edits with minimum total cost. It shows that with non-uniform edit costs, QTM seems to get stuck in local minima and investigates moving whole subtrees as a remedy. While moving subtrees helps, it also significantly increases the running time.

**Acknowledgement.** This work was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under grants BR 2158/11-2 and WA 654/22-2 within the Priority Programme 1736 *Algorithms for Big Data*.

# **References**

	- 5. Brandes, U., Holm, E., Karrenbauer, A.: Cliques in regular graphs and the coreperiphery problem in social networks. In: Chan, T.-H.H., Li, M., Wang, L. (eds.) COCOA 2016. LNCS, vol. 10043, pp. 175–186. Springer, Cham (2016). https://doi. org/10.1007/978-3-319-48749-6 13
	- 6. Bruckner, S., Huffner, F., Komusiewicz, C.: A graph modification approach for finding ¨ core-periphery structures in protein interaction networks. Algorithms Mol. Biol. **10**, 16 (2015). https://doi.org/10.1186/s13015-015-0043-7
	- 7. Cai, L.: Fixed-parameter tractability of graph modification problems for hereditary properties. Inf. Process. Lett. **58**(4), 171–176 (1996). https://doi.org/10.1016/0020- 0190(96)00050-6
	- 8. Chu, F.P.M.: A simple linear time certifying LBFS-based algorithm for recognizing trivially perfect graphs and their complements. Inf. Process. Lett. **107**(1), 7–12 (2008). https://doi.org/10.1016/j.ipl.2007.12.009
	- 9. Corneil, D.G., Lerchs, H., Burlingham, L.S.: Complement reducible graphs. Discret. Appl. Math. **3**(3), 163–174 (1981). https://doi.org/10.1016/0166-218X(81)90013-5
	- 10. Creignou, N., Ktari, R., Meier, A., Muller, J., Olive, F., Vollmer, H.: Parameterised enu- ¨ meration for modification problems. Algorithms **12**(9), 189 (2019). https://doi.org/10. 3390/a12090189
	- 11. Crespelle, C.: Linear-time minimal cograph editing (2019). https://perso.ens-lyon.fr/ christophe.crespelle/publications/SUB minimal-cograph-editing.pdf
	- 12. Dolan, E.D., More, J.J.: Benchmarking optimization software with performance profiles. ´ Math. Program. **91**(2), 201–213 (2002). https://doi.org/10.1007/s101070100263
	- 13. van Dongen, S.M.: Graph Clustering by Flow Simulation. Ph.D. thesis, University of Utrecht (2000)
	- 14. Drange, P.G., Dregi, M.S., Lokshtanov, D., Sullivan, B.D.: On the threshold of intractability. In: Bansal, N., Finocchi, I. (eds.) ESA 2015. LNCS, vol. 9294, pp. 411– 423. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-48350-3 35
	- 15. Drange, P.G., Pilipczuk, M.: A polynomial kernel for trivially perfect editing. Algorithmica **80**(12), 3481–3524 (2017). https://doi.org/10.1007/s00453-017-0401-6
	- 19. Hammer, P.L., Simeone, B.: The splittance of a graph. Combinatorica **1**(3), 275–284 (1981). https://doi.org/10.1007/BF02579333
	- 20. Liu, Y., Wang, J., Guo, J., Chen, J.: Complexity and parameterized algorithms for cograph editing. Theor. Comput. Sci. **461**, 45–54 (2012). https://doi.org/10.1016/j.tcs. 2011.11.040
	- 21. Liu, Y., Wang, J., You, J., Chen, J., Cao, Y.: Edge deletion problems: branching facilitated by modular decomposition. Theor. Comput. Sci. **573**, 63–70 (2015). https://doi. org/10.1016/j.tcs.2015.01.049
	- 22. Luce, R.D., Perry, A.: A method of matrix analysis of group structure. Psychometrika **14**, 95–116 (1949). https://doi.org/10.1007/BF02289146
	- 23. Mahadev, N.V., Peled, U.N.: Threshold Graphs and Related Topics. Ann. Discrete Math. **56**. Elsevier (1995)
	- 24. Mariani, M.S., Ren, Z.M., Bascompte, J., Tessone, C.J.: Nestedness in complex networks: observation, emergence, and implications. Phys. Rep. **813**, 1–90 (2019). https:// doi.org/10.1016/j.physrep.2019.04.001
	- 25. Nastos, J., Gao, Y.: Familial groups in social networks. Soc. Netw. **35**(3), 439–450 (2013). https://doi.org/10.1016/j.socnet.2013.05.001
	- 26. Ohtsuki, T., Mori, H., Kashiwabara, T., Fujisawa, T.: On minimal augmentation of a graph to obtain an interval graph. J. Comput. Syst. Sci. **22**(1), 60–97 (1981). https://doi. org/10.1016/0022-0000(81)90022-2
	- 27. Rahmann, S., Wittkop, T., Baumbach, J., Martin, M., Truß, A., Bocker, S.: Exact and ¨ Heuristic Algorithms for Weighted Cluster Editing. In: CSB, pp. 391–401 (2007). https://doi.org/10.1142/9781860948732 0040
	- 28. Schaeffer, S.E.: Graph clustering. Comput. Sci. Rev. **1**(1), 27–64 (2007). https://doi.org/ 10.1016/j.cosrev.2007.05.001
	- 29. Schmitt, D.: Engineering Heuristic Quasi-Threshold Editing. Master's thesis, Karlsruhe Institute of Technology (2021). https://i11www.iti.kit.edu/ media/teaching/theses/maschmitt-21.pdf
	- 30. Schoch, D., Brandes, U.: Re-conceptualizing centrality in social networks. Eur. J. Appl. Math. **27**(6), 971–985 (2016). https://doi.org/10.1017/S0956792516000401
	- 32. Traud, A.L., Mucha, P.J., Porter, M.A.: Social structure of Facebook networks. Phys. A: Stat. Mech. Appl. **391**(16), 4165–4180 (2012). https://doi.org/10.1016/j.physa.2011.12. 021
	- 33. Wolk, E.S.: A note on "The comparability graph of a tree". Proc. AMS **16**(1), 17–20 (1965). https://doi.org/10.2307/2033992
	- 34. Yan, J., Chen, J., Chang, G.J.: Quasi-threshold graphs. Discret. Appl. Math. **69**(3), 247– 255 (1996). https://doi.org/10.1016/0166-218X(96)00094-7

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **The Space Complexity of Undirected Graph Exploration**

Yann Disser1(B) and Max Klimm<sup>2</sup>

<sup>1</sup> TU Darmstadt, Darmstadt, Germany disser@mathematik.tu-darmstadt.de <sup>2</sup> TU Berlin, Berlin, Germany klimm@tu-berlin.de

**Abstract.** We review the space complexity of deterministically exploring undirected graphs. We assume that vertices are indistinguishable and that edges have a locally unique color that guides the traversal of a space-constrained agent. The graph is considered to be explored once the agent has visited all vertices. We visit results for this setting showing that Θ (log*n*) bits of memory are necessary and sufficient for an agent to explore all *n*-vertex graphs. We then illustrate that, if agents only have sublogarithmic memory, the number of (distinguishable) agents needed for collaborative exploration is Θ(loglog*n*).

**Keywords:** Graph exploration · Multi-agent exploration · Space complexity · Connectivity · Log-space

# **1 Introduction**

When working with large data sets it is no longer justified to assume the entire input, or even a significant fraction of it, to be accessible at once. In particular, data may be spatially distributed along a dynamic network structure, such as the Internet or social networks. In this setting, the systematic navigation or crawling of the network becomes an integral component of any algorithmic processing of the data it holds. The theoretical framework of graph exploration is concerned precisely with the algorithmic problem of systematically traversing an initially unknown graph.

Generally, the main questions in graph exploration are regarding *feasibility*, i.e.,how much computational power is necessary for systematic exploration, and regarding *efficiency*, i.e.,how quickly a graph can be explored algorithmically. In the context of dealing with large data sets, the feasibility question is of particular importance. The necessary computational power can be captured theoretically by the space complexity of the exploration problem. Intuitively, the question is what portion of a graph we need to be able to memorize in order to avoid running in circles.

In this chapter, we review the most important results regarding the space complexity of undirected graph exploration. In Sect. 2, we introduce the graph exploration framework in more detail. In Sect. 3, we outline a general lower bound on the space complexity of graph exploration of Ω(log*n*). Reingold's algorithm for undirected graph exploration is presented in Sect. 4. We then turn to collaborative graph exploration by a set of agents. In Sect. 5, we show that when all agents have sub-logarithmic memory *O* (log1<sup>−</sup>ε *n*) for some ε > 0, then Ω (loglog*n*) agents are needed to explore any undirected graph with *n* vertices. Finally, in Sect. 6, we provide a matching upper bound showing that a team of *O* (loglog*n*) agents can explore deterministically any undirected *n*-vertex graph, even if each agent has only constant memory.

The aim of this chapter is to survey the key ideas of these results, and we only sketch proofs on a high level. Whenever possible, intuition is prefered over formal statements, and many details are omitted to increase accessibility. For a more formal treatment, we refer to the original papers. Pointers to the relevant literature are given in Sect. 7.

# **2 Exploration and Feasibility**

In the following, we consider an agent initially located at a vertex *v*<sup>0</sup> of an unknown, edge-colored, undirected graph *G* = (*V*,*E*). We assume the edge-coloring to be locally unique in the sense that no two edges incident to a common vertex may share a color. The agent's perception of *G* is limited to observing the set of colors of the edges incident to its current location. In every step, the agent may choose one of these colors and move to the other endpoint of the corresponding edge. Importantly, vertices with the same set of colors adjacent to them are indistinguishable to the agent. The objective of the agent is to explore *G*, i.e.,to systematically visit all vertices of *G* in a finite number of steps. We are looking for a *deterministic* traversal algorithm that guarantees to explore every undirected graph. Regarding *randomized* traversal algorithms, it is known that a random walk of length *n*<sup>5</sup> log*n* visits all vertices of any graph with *n* vertices with high probability (Aleliunas et al. [1]). This yields a constant-space *perpetual* randomized graph exploration algorithm, i.e., an algorithm that runs forever and eventually visits all vertices. If *n* is known, combining this algorithm with a counter counting up to *n*<sup>5</sup> log*n* yields a log-space randomized graph exploration algorithm.

To illustrate the difficulty of deterministic exploration in this weak agent model, consider the exploration of a *fully regular* graph *G*, i.e.,a graph where all vertices are incident to edges of the exact same set of colors (cf. Fig. 1). Even if the agent knows that *G* is fully regular, after the first step where it learns the degree of the graph, its observations contain no information at all. In particular, every deterministic exploration algorithm must produce the same sequence of colors for any two fully regular graphs using the same colors. Intuitively, this is the most challenging setting for exploration. Then, the algorithmic problem reduces to asking for a *universal traversal sequence*, i.e.,a sequence of colors that we can follow to eventually visit all vertices, irrespective of *G* and *v*0. Here and throughout, following a color sequence means to perform a sequence of movement decisions according to it, and we say that a color sequence explores *G* if the agent visits all vertices when following it.

The exploration problem is feasible in the sense that a universal traversal sequence always exists for fully regular graphs. To see this, follow any path in an edge-colored graph and then return to the starting location by backtracking along the same path to get a color sequence that is a palindrome. Conversely, following a color sequence that is a palindrome guarantees to yield a closed tour, irrespective of the graph and the starting location. This means that we can obtain a universal traversal sequence by chaining

**Fig. 1.** A regular graph with two different starting locations. By following the color sequence "green, blue, red", the agent either moves on a cycle (left) or not (right), but there is no way to distinguish between these two cases as vertices are indistinguishable. (Color figure online)

together all color sequences that are palindromes in order of increasing lengths. The resulting sequence is guaranteed to follow every path from the starting location, irrespective of the graph, and thus to eventually visit all vertices.

For non-regular graphs, a universal traversal sequence seems unattainable since not every color needs to be available at every vertex. However, the exploration of an arbitrary non-regular graph *G* = (*V*,*E*) can be reduced to the exploration of a fully regular graph *G*freg = (*V*freg,*E*freg) via the construction shown in Fig. 2. To this end, we first construct a regular graph *G*reg = (*V*reg,*E*reg) with bi-colored edges. For every vertex *v* ∈ *V* and each color *c* of its adjacent edges, we introduce a color copy (*v*,*c*) ∈ *V*reg, connect the color copies of *v* in a cycle and add the original edges between the respective color copies. The resulting graph has only three colors. The edges in the cycles are bi-colored with one color pointing to the next color copy, and one color pointing to the previous color copy. Edges between color copies of different vertices have a third color. We proceed to eliminate the bi-colored edges in *G*reg and obtain a fully regular graph *G*freg. This can be done by first adding an intermediate vertex for each bi-colored edge, and then mirroring (i.e.,copying) the entire construction and connecting each vertex of degree 2 with its reflection with the third color.

As explained above, there is a universal traversal sequence for 3-regular graphs and, thus, the sequence also explores *G*freg. Given a universal traversal sequence for *G*freg, we can explore *G* with an additional memory overhead that is logarithmic in the maximum degree of the original graph and, thus, in *O* (log*n*). The idea is to perform a virtual traversal of *G*freg and only actually move in *G* whenever the virtual traversal transitions between color copies of different vertices of *G*. The memory is used to store which color copy of its location in *G* the agent is (virtually) located at in *G*freg, as well as whether it is at a vertex or its reflection and whether it is located on the intermediate vertex of a bi-colored edge.

While we have now established the general feasibility of the exploration problem, the constructed exploration algorithm is not very satisfactory in the sense that it enumerates an exponential number of sequences before all vertices are guaranteed to have been visited. This means that the algorithm requires an exponential number of moves and a linear memory size to keep track of its current state. Note that as long as the color sequence remains aperiodic, linear memory is needed to perform an exponential number of steps and, conversely, making use of a linear number of memory bits means visiting an exponential number of memory states and thus an exponential running time. In that sense, there is a direct correspondence between exponential time and linear memory. From now on, we focus on memory usage only. The natural question in this context becomes: Can we solve the exploration problem in sub-linear memory?

**Fig. 2.** Turning an arbitrary graph *G* into a regular graph *G*reg with bi-colored edges and further into a fully regular graph *G*freg. In the construction, we order the four colors of *G* cyclically as yellow-red-green-blue. In *G*reg, brown edges point to the next color available at the corresponding vertex in *G*, teal edges point to the previous color, and purple edges move to a color copy of another vertex. To construct *G*freg from *G*reg, an intermediate vertex is added to the center of each bi-colored edge, the graph is copied, and two corresponding intermediate vertices are connected by a purple edge. Starting with the color yellow on the left copy in *G*freg, the color sequence "teal-purple-brown-teal-brown-purple-brown-teal-purple" for *G*freg leads to the movement along a blue edge and a yellow edge as indicated in *G*. (Color figure online)

# **3 Trapping a Single Agent**

To approach the question of how much memory is necessary in general to deterministically explore a graph *G* of size *n*, we first need to realize how insufficient memory can manifest itself in terms of the inability of the agent to explore: Essentially, the only way that the agent may fail to explore *G* in finite time is by getting "trapped" in periodic behavior that forces it to move on a closed tour eternally, without having visited all vertices. With this in mind, we make the following definition.

**Definition 1.** *A* trap *for an exploration algorithm is given by an edge-colored graph G together with an initial location v*0*, such that the algorithm never visits all vertices of G when starting at v*0*.*

We fix a deterministic exploration algorithm *<sup>A</sup>* with a finite number *<sup>b</sup>* <sup>∈</sup> <sup>N</sup> of memory bits and construct a trap of some size *n* for this algorithm. The size of our trap then bounds the largest size of graphs that the algorithm can explore. Conversely, since the construction can be carried out for any deterministic algorithm, we obtain a lower bound on the required number of memory bits necessary to explore graphs of size (up to) *n*.

To construct a trap *G* for *A* , first observe that *A* has at most 2*<sup>b</sup>* different memory states at its disposal. Our construction ensures that *G* is a fully regular graph of degree 3, using a fixed set of three colors*C*. As observed in the previous section, *A* is sure to yield the same sequence *S* of colors for all fully regular graphs using colors*C* and irrespective of the initial location *v*0. Since *A* has at most 2*<sup>b</sup>* different states, it must enter at least one state for the second time within the first 2*<sup>b</sup>* steps. Assume the same state is entered in steps 1 <sup>≤</sup> *<sup>i</sup>* <sup>&</sup>lt; *<sup>j</sup>* <sup>≤</sup> <sup>2</sup>*b*. Then the behavior of *<sup>A</sup>* and, consequently, *<sup>S</sup>* must become periodic after step *<sup>i</sup>*, i.e.,*<sup>S</sup>* = (*c*1,...,*ci*−1) <sup>⊕</sup> *<sup>S</sup>*<sup>∞</sup> <sup>p</sup> , where '⊕' denotes concatenation of sequences, and *S*<sup>p</sup> is a finite subsequence of *S* of length *j* −*i*.

Consider the infinite walk *W* = (*v*0,*v*1,*v*2,...) induced by *S* in the infinite 3-regular tree where the set of colors of the edges incident to each vertex is *C*; cf. Fig. 3 (top). By definition, *A* is in the same memory state after steps *i* and *j*, implying that it follows the same infinite color sequence starting at *vi* in steps *i*+1,*i*+2,... as it does starting at *vj* in steps *j* +1, *j* +2,.... Assume that *vi* = *vj*. Then, the algorithm moves on a closed tour of length *<sup>j</sup>* <sup>−</sup> *<sup>i</sup>* after step *<sup>i</sup>* while having visited at most *<sup>i</sup>* <sup>+</sup> *<sup>j</sup>* <sup>−</sup> *<sup>i</sup>* <sup>=</sup> *<sup>j</sup>* <sup>≤</sup> <sup>2</sup>*<sup>b</sup>* different vertices. We can now take the subgraph *G* of the infinite tree induced by all edges incident to vertices in *W* as our trap. Note that this graph need not be fully regular, but we can add missing edges by mirroring *G* as before (cf. Sect. 2) and connecting corresponding vertex pairs of degree smaller three by an edge of a color they are missing. This decreases the number of missing colors at all vertices of degree smaller three and needs to be repeated at most once to make the graph fully regular.

In the case *vi* = *vj* the algorithm may visit an infinite number of different vertices. The intuitive idea now is to "close a loop" by ensuring that both the edges of color *ci*<sup>+</sup><sup>1</sup> = *cj*<sup>+</sup><sup>1</sup> at *vi* and at *vj* lead to the same vertex. Of course, we cannot simply replace the edge of color *cj*<sup>+</sup><sup>1</sup> at *vj* by the edge {*vj*,*vi*<sup>+</sup>1} of the same color, since we also need to keep the edge {*vi*,*vi*<sup>+</sup>1} of this color. However, we can achieve the same result by "folding" *vi* onto *vj*, i.e.,by identifying *vi* = *vj* and identifying the predecessors of *vi* along *W* and their neighborhoods accordingly. More precisely, we identify each vertex

**Fig. 3.** Construction of a trap for a single agent with *b* bits of memory. Top: After at most 2*<sup>b</sup>* steps in a fully regular graph, the same memory state must repeat (purple vertices). Bottom: Closing a loop to trap the agent on a closed walk. (Color figure online)

*v* adjacent to *vi* with the unique vertex *v* adjacent to *vj* such that the colors of the edges {*vi*,*v*} and {*vj*,*v* } coincide. We repeat this process for all vertices *vi*−1,...,*v*<sup>0</sup> along *W*; cf Fig. 3 (bottom). Afterwards, we again take the subgraph induced by {*v*0,...,*vj*} together with their neighbors as our trap, making it fully regular as before.

In either case, we have constructed a trap of size *n* = *O* (2*b*). Since we can perform this construction for any deterministic algorithm with *b* memory bits, this implies a lower bound of Ω (log*n*) on the required number of memory bits to explore every graph of size up to *<sup>n</sup>* <sup>∈</sup> <sup>N</sup>. We have shown the following.

**Theorem 1 (Fraigniaud et al.** [12]**).** *The number of memory bits needed for undirected, deterministic graph exploration is* Ω(log*n*)*.*

# **4 Reingold's Algorithm**

We will see that the lower bound shown in Sect. 3 on the memory needed to explore an undirected graph deterministically is tight, i.e.,undirected graphs with *n* vertices can be explored deterministically with *O* (log*n*) memory. This algorithmic result follows from a famous result of Reingold [16] in which he established that USTCON ∈ *L* . Here, *L* is the class of problems solvable with logarithmic memory and USTCON is the problem of deciding, for a given undirected graph *G* = (*V*,*E*) and two designated vertices *s*,*t* ∈ *V*, whether *s* and *t* are connected in *G*. The algorithm devised by Reingold for his proof can be turned into a log-space exploration algorithm, which we outline in the following.

We first argue that fully regular graphs with constant degree and good vertex expansion can be explored with logarithmic memory. Suppose the graph *G* is fully regular with constant degree *d* and enjoys the property that there is a constant ε > 0 such that for all vertex sets *S* ⊂ *V* with |*S*| ≤ *n*/2 there are at least (1 + ε)|*S*| vertices that are connected by an edge to a vertex in *S*. An upshot of this vertex expansion property is that the graph has at most logarithmic diameter. Indeed, for an arbitrary vertex *u* ∈ *V* there are more than *n*/2 vertices within a distance of *k* = log(*n*/2) log(1+ε) +1 of *u*, so that every pair of vertices has a common vertex within distance *k* and, thus, the diameter is at most 2*k* ∈ *O*(log*n*). Similar to the argument in Sect. 2, it suffices to enumerate all returning color sequences of length 2*k* which can be done with *O* (log*n*) space.

Regularity can be achieved with the transformation from *G* to *G*reg explained in Sect. 2. Here, we stick to *G*reg with its bi-colored edges instead of transforming *G*reg further into *G*freg since the bi-colored edges of *G*reg do not harm our further arguments. We proceed to describe further transformations that turn *G*reg into another regular graph *G*exp with good vertex expansion. Let *G* be a fully *d*-regular graph with *n* vertices and let *H* be a *c*-regular graph with *d* vertices where *c* and *d* are constants. Then the *replacement product G*<sup>r</sup> *H* is the graph where each vertex *v* in *G* is replaced by a copy of *H* that we call the *cloud* of *v*. The edges within a cloud keep the colors that they have in *H*. For each edge of *G*, we introduce an edge with a new inter-cloud color between the respective vertices in the corresponding clouds; cf. Fig. 4. The resulting graph *G*<sup>r</sup> *H* is fully regular with degree *c*+1. Based on the replacement product *G*<sup>r</sup> *H*, we introduce another graph product, the *zig-zag product G*<sup>z</sup> *H*. The zig-zag-product *G*<sup>z</sup> *H* has the same set of vertices as the replacement product *G*<sup>r</sup> *H*, but only edges between vertex clouds of different vertices. Specifically, let (*u*,*i*) be a vertex belonging to the cloud of *u*, and (*v*, *j*) be a vertex belonging to the cloud of *v*. Then, the edge {(*u*,*i*),(*v*, *j*)} is contained in the replacement product if and only if there is path of length three from (*u*,*i*) to (*v*, *j*) in *G*<sup>r</sup> *H* where the middle edge is an edge between different clouds. For a vertex (*u*,*i*) there are exactly *c*<sup>2</sup> such paths starting in (*u*,*i*): the first degree of freedom is to choose one of *c* colors (within the current cloud), then change the cloud with an inter-cloud edge, and then choose one of *c* colors for the second cloud. Associating each of these *c*<sup>2</sup> color combinations with a new color in *G*<sup>z</sup> *H*, we obtain that *G*<sup>z</sup> *H* is fully regular with degree *c*2. We note that this construction also works if *G* has bicolored edges by allowing inter-cloud edges also between different copies of vertices of *H*. In any case, we may end up with a graph *G*<sup>z</sup> *H* having bi-colored edges. Suppose that *H* is of constant size and that we have a traversal sequence for *G*<sup>z</sup> *H*. Then, every edge traversal in *G*<sup>z</sup> *H* corresponds to three edge traversals in *G*<sup>r</sup> *H*. We maintain a stack of future edge traversals in *G*<sup>r</sup> *H*. Since *H* has constant degree, so has *G*<sup>r</sup> *H*, and we can store this stack with up to three colors in constant memory. In this way, we obtain a traversal sequence for *G*<sup>r</sup> *H* with constant memory overhead. From a traversal sequence for *G*<sup>r</sup> *H*, we further obtain a traversal sequence for *G* by memorizing the current copy of the vertex of *H* within the current cloud similar to the virtual traversal of *G*freg in Sect. 2. As *H* has constant size, this requires only constant memory overhead. We conclude that a traversal sequence for *G*<sup>z</sup> *H* can be used to traverse *G* with constant memory overhead.

**Fig. 4.** The replacement product *G*<sup>r</sup> *H* and the zig-zag-product *G*<sup>z</sup> *H* for two graphs *G* and *H*. In the replacement product *G*<sup>r</sup> *H*, edges within a cloud keep the colors they had in *H*, here brown or teal. The edges between clouds get a new inter-cloud color, here purple. Every edge in *G*z *H* corresponds to a path of length three in *G*<sup>r</sup> *H* where the middle edge is purple, e.g.,an edge color gold in *G*<sup>z</sup> *H* corresponds to a path in *G*<sup>r</sup> *H* that is teal-pink-brown. (Color figure online)

It is left to show that we can transform *G*reg into a graph with good vertex expansion. In order to show that a *d*-regular graph has good vertex expansion, it suffices to show that the second largest eigenvalue λ of the normalized adjacency matrix is bounded from above by a constant strictly smaller than 1; cf. Tanner [21], Alon and Milnan [3], and Alon [2]. For the normalized adjacency matrix *M* = (*mu*,*v*)*u*,*v*∈*<sup>V</sup>* , the entry *mu*,*<sup>v</sup>* is defined as 1/*d* times the number of edges from *u* to *v*. For ease of notation, we call a *d*regular graph on *n* vertices an (*n*,*d*,α)-graph if λ ≤ α. We use the following properties of the second largest eigenvalues of regular graphs:

1. Alon and Sudakov [5]:

A *d*-regular, connected, non-bipartite *n*-vertex graph is a - *<sup>n</sup>*,*d*,1<sup>−</sup> <sup>1</sup> *dn*2 -graph.

2. Basic linear algebra:

Taking the *k*-th power of a graph means introducing an edge for each *k*-edge path in the original graph. If *G* is an (*n*,*d*,λ)-graph, then its *k*-th power is an (*n*,*dk*,λ*k*) graph.


Let *H* be a (*c*16,*c*,1/2)-graph with *c* constant as in Property 3.. For an arbitrary graph *<sup>G</sup>* on *<sup>n</sup>* vertices, first construct *<sup>G</sup>*reg. Let *<sup>G</sup>*<sup>0</sup> be equal to *<sup>G</sup>*reg except that *<sup>c</sup>*<sup>16</sup> <sup>−</sup> 3 self loops are added to each vertex. Let - = 2 log(*c*16*n*4). For *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,-, define *Gi* = (*Gi*−1<sup>z</sup> *<sup>H</sup>*)8, i.e.,to obtain the next graph in the sequence, we first apply the zig-zag product with *H* and then take the 8-th power of the resulting graph. Note that this is well-defined since *Gi*−1<sup>z</sup> *<sup>H</sup>* has degree *<sup>c</sup>*2, so that *Gi* = (*Gi*−1<sup>z</sup> *<sup>H</sup>*)<sup>8</sup> has degree *<sup>c</sup>*16, and *Gi* <sup>z</sup> *H* is defined. Any traversal sequence for *Gi* can be transformed with constant memory overhead to a traversal sequence for *Gi*−1, since it involves taking the zig-zag product with a graph of constant size and power 8 (which requires to memorize up to 7 additional steps). Thus, a traversal sequence for *G* can be transformed to a traversal sequence for *G*<sup>0</sup> and, hence, an exploration sequence for *G* with memory overhead of *O* (-) = *O* (log*n*). It remains to show that *G* has good vertex expansion. We claim that λ(*Gi*) ≤ max{λ(*Gi*−1)2,1/2} for all *<sup>i</sup>* <sup>=</sup> <sup>1</sup>,...,-. To prove the claim, let λ = λ(*Gi*−1) and note that, by Property 4.,

$$\begin{aligned} \lambda \left( G\_{l-1} \ominus H \right) \le \frac{1}{8} \left( 3\lambda + \sqrt{9\lambda^2 + 16} \right) &\le \frac{1}{8} \left( 3\lambda + 5 \right) \\ &= 1 - \frac{3}{8} \left( 1 - \lambda \right) < 1 - \frac{1}{3} \left( 1 - \lambda \right), \end{aligned}$$

implying λ(*Gi*) < - <sup>1</sup><sup>−</sup> <sup>1</sup> <sup>3</sup> (1−λ) <sup>8</sup> by Property 2. If λ < <sup>1</sup> <sup>2</sup> , then λ(*Gi*) < - 5 6 8 < <sup>1</sup> 2 . Otherwise, it is straightforward to verify that the function *<sup>f</sup>*(*x*)=(1<sup>−</sup> <sup>1</sup> <sup>3</sup> (1−*x*))<sup>4</sup> is convex on [0,1] and 1 ≥ *f*(1) as well as 1/2 ≥ *f*(1/2). We conclude that *f*(*x*) ≤ *x* for all *x* ∈ [1/2,1], in particular

$$\left(1 - \frac{1}{3}\left(1 - \lambda\right)\right)^4 \le \lambda, \quad$$

implying λ(*Gi*) ≤ λ2. Finally, the graph *G*<sup>0</sup> is regular with degree *c*<sup>16</sup> and has at most *n*<sup>2</sup> nodes. By Property 1. this implies that λ(*G*0) <sup>≤</sup> <sup>1</sup><sup>−</sup> <sup>1</sup> *<sup>c</sup>*16*n*<sup>4</sup> . With the claim above, we obtain

$$\begin{split} \lambda(G\_{\ell}) \le \max \left\{ \left( 1 - \frac{1}{c^{16}n^4} \right)^2, \frac{1}{2} \right\} \\ \le \left\{ \left( \left( 1 - \frac{1}{c^{16}n^4} \right)^{c^{16}n^4} \right)^2, \frac{1}{2} \right\} \le \left\{ \left( 1 - \frac{1}{e} \right)^2, \frac{1}{2} \right\} = \frac{1}{2}. \end{split}$$

As we sketched above, the transformation from *G* to *G* requires only logarithmic memory and can be conducted locally, i.e.,a traversal sequence for *G* can be transformed into an exploration sequence for *G* with logarithmic space overhead. Finally, we eliminate the bi-colored edges of *G* as in Sect. 2. Since this construction increases the diameter of the graph by at most a factor of 2, it has still a logarithmic diameter, so that a traversal sequence can be constructed with logarithmic space. This yields the following result.

**Theorem 2 (Reingold** [16]**).** *Undirected graphs can be deterministically explored with an agent that has O* (log*n*) *bits of memory.*

# **5 Trapping Multiple Agents**

After having established that Θ (log*n*) memory bits are necessary and sufficient for deterministic exploration with a single agent, we now investigate whether this bound can substantially be lowered by allowing additional agents. More precisely, we consider a setting with *k* ≥ 2 deterministic and distinguishable agents that behave as before individually, but move in a synchronized fashion and may exchange information while co-located at a vertex. To see that allowing collaboration makes a fundamental difference, even for *k* = 2, observe that, for example, two agents can distinguish closed tours simply by leaving one of them at the starting location; cf. Fig. 1. This additional power is also evidenced by a drastically increased difficulty of constructing traps: For a long time, the smallest known traps for *k* agents, with *s* memory states each, had a size of *O*- *ss*.· *s* with *O* (*k*) levels in the exponent (Fraigniaud et al. [13], Rollik [18]), compared to the singly exponential bound of Theorem 1.

To see that a substantially different approach is needed to trap multiple agents, recall the construction in Sect. 3: The intuitive idea was to add vertices along a tree until the agent enters a memory state for the second time, at which point we close a loop. Since the number of memory states available to the agent is bounded by a constant, namely 2*b*, this yielded a trap of singly exponential size in *b*. The key difference when allowing multiple agents is that the behavior of the agents no longer only depends on their collective memory state. It now might make a difference in the behavior of the algorithm at what points agents meet – which is exactly the reason, why they can distinguish cycles, as explained above. This means that the behavior of the algorithm may depend in a non-trivial way on the positions of the agents in the graph, relative to each other. As we increase the number of vertices *n*, the number of such configurations grows as *nk*, and we can no longer hope for configurations to ever repeat.

The key idea to overcome this is to force the agents to stay "close" to each other, which ensures that the number of configurations stays bounded and allows us to use the same general approach as before. The following informal definition generalizes the notion of a trap to multiple agents.

**Definition 2.** *A k-*barrier *Bk in a graph G for an algorithm A is a subgraph of G whose removal disconnects the graph into two connected components, with the property that no agent ever traverses Bk from one component to the other without at least k other agents entering Bk during the traversal.*

In particular, a 1-barrier plays the role of a simultaneous trap for every individual agent. Note that agents may behave differently from one another, so we need to deal with each

**Fig. 5.** Sketch of the construction of an *i*-barrier. Boxes indicate (*i*−1)-barriers.

one using a separate construction. We have seen in Sect. 3 how to construct a trap for a single agent, and we can essentially chain traps together for the individual agents in order to obtain a 1-barrier. We will now describe how to recursively construct *i*-barriers for *i* ∈ {2,...,*k*}. Once we have constructed a *k*-barrier, we have the desired trap for the set of all *k*-agents.

The idea of the recursive construction of an *i*-barrier is to use the same approach as in the trap for a single agent, but replacing every edge by an (*i*−1)-barrier; cf. Fig. 5. More precisely, we fix any set of *i* agents and assume that only these agents enter our construction. Since, on a meta-level, edges are now (*i*−1)-barriers, the agents can only traverse these "meta-edges" if they all enter the corresponding barrier, i.e.,if they stay somewhat close together. Essentially, throughout the traversal, all agents are guaranteed to be located in one of the three (*i* − 1)-barriers surrounding some meta-vertex. Of course, the same is true recursively within every (*i*−1)-barrier containing at most *i*−1 of the agents. By a careful recursive inspection, the total number of configurations of the agents can be bounded independently of the number of meta vertices. This allows a similar approach as before: Add meta-vertices until a configuration repeats and close a loop to obtain a trap. To obtain an *i*-barrier, we again need to chain traps for every subset of *i* agents together.

With some refinement and a thorough analysis, it can be shown that this yields a *k*-barrier, and thus a trap, of size *O*(*s*25*<sup>k</sup>* ) for *k* agents with *s* memory states each. In other words, the agents can explore graphs of size up to *<sup>n</sup>* <sup>≤</sup> *<sup>s</sup>*25*<sup>k</sup>* , i.e.,log*<sup>n</sup>* <sup>≤</sup> 25*<sup>k</sup>* · log*<sup>s</sup>* has to hold. Assuming that each agent has *O* (log1<sup>−</sup>ε *n*) bits of memory for some ε ∈ (0,1), i.e.,just shy of the number needed to explore the graph on its own, we obtain log*s* = *O*(log1<sup>−</sup>ε *n*). Combining both bounds and taking logarithms yields *k* = Ω - log log*n* log1−ε *n* = Ω (loglog*n*). This means that we need at least *k* = Ω(loglog*n*) agents to explore undirected graphs of size *n*, even if every agent has almost enough memory to explore on its own!

**Theorem 3. (Disser et al.** [10 SPP]**).** *Deterministic exploration of undirected graphs needs at least* Ω(loglog*n*) *agents if we allow O*(log1<sup>−</sup>ε *n*) *bits of memory for every agent, where* ε> 0*.*

# **6 Multi-agent Exploration**

We outline the design of a collaborative exploration algorithm that matches the lower bound of Theorem 3, i.e.,we show that *O* (loglog*n*) agents with sub-logarithmic memory are sufficient to explore unknown graphs of size up to *n*. Observe that *O* (log*n*) agents are trivially sufficient by Reingold's algorithm (Theorem 2), since we can let agents move together and make each one responsible for maintaining a constant number of memory bits.

We start with a single agent with a constant number *<sup>m</sup>*<sup>0</sup> <sup>∈</sup> <sup>N</sup> of memory bits and show how to iteratively boost its memory by using a small number of additional agents. First consider how much progress, in terms of visiting vertices, the agent is able to accomplish on its own. For a single agent, we already know Reingold's algorithm which needs logarithmic space. Expressed differently, executing Reingold's algorithm with *m*<sup>0</sup> bits of available memory guarantees that the agent visits a number of distinct vertices of order Ω(2*m*<sup>0</sup> ), or completes the exploration.

These vertices can be visited multiple times, and, in general, there is no way of knowing the order in which the vertices appear during the traversal *T* produced by Reingold's algorithm. However, using one additional agent to mark vertices and multiple repetitions of traversal with Reingold's algorithms for different positions of the additional agent, it can be shown that we can treat *T* as a simple cycle without self intersections. Assuming that the agent has this cycle *T* of length Ω (2*m*<sup>0</sup> ), for some constant *<sup>c</sup>* <sup>∈</sup> <sup>N</sup>, that it can navigate systematically, it can position a constant number *<sup>a</sup>* <sup>∈</sup> <sup>N</sup> of additional agents along *<sup>T</sup>*. Since agents are distinguishable, there are <sup>|</sup>*T*<sup>|</sup> *a* configurations that can be established in this way. The key idea now is to use the configuration of the agents along *T* as a form of virtual memory state, in order to boost the amount of memory available to the agent.

The number of memory bits that can be encoded in this way is *m*<sup>1</sup> = log|*T*| *<sup>a</sup>*, which is of order *am*0. This means that we have boosted the initial memory capacity roughly by a factor of *a*. Having more (virtual) memory at its disposal, the agent can now recursively repeat the procedure, again boosting the memory by another factor of *a*, and so on. After loglog*n* levels of recursion, the amount of virtual memory is of order *<sup>a</sup>*loglog*<sup>n</sup>* · *<sup>m</sup>*<sup>0</sup> <sup>=</sup> Ω (log*n*). But we already know that this is sufficient to complete the exploration, by Theroem 2.

For this approach to yield the claimed bound, it is crucial to argue that only a constant number of agents and memory bits are needed in each recursive level, not only to encode, but also to manipulate the virtual memory. In particular, in each move performed in some level of the recursion, the agents encoding the virtual memory on lower recursive levels need to be moved in the graph to stay in the same positions relative to the agent. It can be shown that this is indeed possible with a constant overhead in agents, and we obtain the following tight result.

**Theorem 4. (Disser et al.** [10 SPP]**).** *Undirected graphs can be deterministically explored with O* (loglog*n*) *agents, even if we only allow constant memory for every agent.*

# **7 Bibliographic Notes**

The first exploration algorithms were designed for mazes. A *maze* is a subgraph of the two-dimensional grid where the vertices are indistinguishable and each edge is labeled with its cardinal direction. To facilitative the exploration, the agent is sometimes equipped with a set of distinguishable pebbles that can be dropped and retrieved at nodes. After initial non-tight results (Blum and Sakoda [7], Budach [8], Shah [20]), it has been proven that an agent with finite memory needs two pebbles to explore any maze (Blum and Kozen [6]) and that one pebble does not suffice (Hoffmann [14]). Blum and Kozen [6] further showed that also two agents with finite memory can explore all mazes.

General undirected graphs are harder to explore. The lower bound of Θ (log*n*) on the memory needed by a single agent to explore all undirected vertex graphs deterministically given in Sect. 3 is due to Fraigniaud et al. [12]. Aleliunas et al. [1] showed that a random walk of length *n*<sup>5</sup> log*n* explores an undirected *n*-vertex graph with high probability. The deterministic algorithm exploring undirected vertex graphs explained in Sect. 4 is due to Reingold [16]. We here follow the presentation of the algorithm and the analysis of Reingold's original paper. There are also alternative proofs for this result that avoid the use of the zig-zag-product; see Rozenman and Vadhan [19]. Reingold's algorithm constructs a universal exploration sequence. This concept was introduced by Koucky [15].

Regarding the exploration of a graph by a set of cooperating agents, Blum and Kozen [6] showed that three agents with finite memory cannot explore all finite undirected planar graphs. Rollik [18] strengthened this result showing that for any number *<sup>k</sup>* <sup>∈</sup> <sup>N</sup> of agents with *<sup>s</sup>* <sup>∈</sup> <sup>N</sup> states, there is a *trap* of size *<sup>O</sup>*- *ss*.· with 2*k*+1 levels in the exponent, i.e.,a graph that the agents cannot explore. Fraigniaud et al. [13] improved this bound to *k* + 1 levels in the exponent. The non-planar trap of size *O*- *s*25*<sup>k</sup>* given in Sect. 5 is due to Disser et al. [10 SPP]. This result implies that if each agent has a sublogarithmic memory of *O* (log1<sup>−</sup>ε *n*) with ε > 0, then *O* (loglog*n*) agents are needed to explore all undirected *n*-vertex graphs. Another consequence from their construction is that a single agent with sublogarithmic memory needs *O* (loglog*n*) pebbles to explore all undirected *n*-vertex graphs. The result that *O* (loglog*n*) agents with constant memory can explore all undirected *n*-vertex graphs presented Sect. 6 is due to Disser et al. [10 SPP]. They actually showed that a single agent with constant memory and *O* (loglog*n*) pebbles can explore the graph and provide a general reduction from agents to pebbles. They further proved that their algorithm runs in polynomial time. For results regarding the exploration time needed by an agent with unconstrained memory, see Dudek et al. [11] and Chalopin et al. [9].

# **References**

	- 11. Dudek, G., Jenkin, M., Milios, E.E., Wilkes, D.: Robotic exploration as graph construction. IEEE Trans. Robot. Autom. **7**(6), 859–865 (1991). https://doi.org/10.1109/ 70.105395
	- 12. Fraigniaud, P., Ilcinkas, D., Peer, G., Pelc, A., Peleg, D.: Graph exploration by a finite automaton. Theor. Comput. Sci. **345**(2–3), 331–344 (2005). https://doi.org/10.1016/j. tcs.2005.07.014
	- 13. Fraigniaud, P., Ilcinkas, D., Rajsbaum, S., Tixeuil, S.: The reduced automata technique for graph exploration space lower bounds. In: Goldreich, O., Rosenberg, A.L., Selman, A.L. (eds.) Theoretical Computer Science. LNCS, vol. 3895, pp. 1–26. Springer, Heidelberg (2006). https://doi.org/10.1007/11685654 1
	- 14. Hoffmann, F.: One pebble does not suffice to search plane labyrinths. In: Gecseg, F. (ed.) ´ FCT 1981. LNCS, vol. 117, pp. 433–444. Springer, Heidelberg (1981). https://doi.org/ 10.1007/3-540-10854-8 47
	- 15. Koucky, M.: Universal traversal sequences with backtracking. J. Comput. Syst. Sci. ´ **65**(4), 717–726 (2002). https://doi.org/10.1016/S0022-0000(02)00023-5
	- 16. Reingold, O.: Undirected connectivity in log-space. J. ACM **55**(4), 17:1–17:24 (2008). https://doi.org/10.1145/1391289.1391291
	- 17. Reingold, O., Vadhan, S., Wigderson, A.: Entropy waves, the zig-zag graph product, and new constant-degree expanders. Ann. Math. **155**(1), 157–187 (2002). https://doi.org/10. 2307/3062153
	- 18. Rollik, H.: Automaten in planaren graphen. Acta Informatica **13**, 287–298 (1980). https://doi.org/10.1007/BF00288647
	- 19. Rozenman, E., Vadhan, S.: Derandomized squaring of graphs. In: Chekuri, C., Jansen, K., Rolim, J.D.P., Trevisan, L. (eds.) APPROX/RANDOM -2005. LNCS, vol. 3624, pp. 436–447. Springer, Heidelberg (2005). https://doi.org/10.1007/11538462 37
	- 20. Shah, A.N.: Pebble automata on arrays. Comput. Graph. Image Process. **3**(3), 236–246 (1974). https://doi.org/10.1016/0146-664X(74)90017-3
	- 21. Tanner, R.M.: Explicit concentrators from generalized *n*-gons. SIAM J. Alg. Disc. Meth. **5**(3), 287–293 (1984). https://doi.org/10.1137/0605030

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Algorithms for Big Data and Their Applications**

# **Scalable Cryptography**

Dennis Hofheinz1(B) and Eike Kiltz2

<sup>1</sup> ETH Zürich, Zürich, Switzerland hofheinz@inf.ethz.ch <sup>2</sup> Ruhr-Universität Bochum, Bochum, Germany eike.kiltz@rub.de

**Abstract.** In our modern digital society, cryptography is vital to protect the secrecy and integrity of transmitted and stored information. Settings like digital commerce, electronic banking, or simply private email communication already rely on encryption and signature schemes.

However, today's cryptographic schemes do not scale well, and thus are not suited for the increasingly large sets of data they are used on. For instance, the security guarantees currently known for RSA encryption—one of the most commonly used type of public-key encryption scheme—degrade linearly in the number of users and ciphertexts. Hence, larger settings (such as cloud computing, or simply the scenario of encrypting all existing email traffic) may enable new and more efficient attacks. To maintain a reasonable level of security in larger scenarios, RSA keylengths must be chosen significantly larger, and the scheme becomes very inefficient. Besides, a switch in RSA keylengths requires an update of the whole public key infrastructure, an impossibility in truly large scenarios. Even worse, when the scenario grows beyond an initially anticipated size, we may lose all security guarantees.

This problematic is the motivation for our project "Scalable Cryptography", which aims at offering a toolbox of cryptographic schemes that are suitable for huge sets of data. In this overview, we summarize the approach, and the main findings of our project. We give a number of settings in which it is possible to indeed provide scalable cryptographic building blocks. For instance, we survey our work on the construction of scalable public-key encryption schemes (a central cryptographic building block that helps secure communication), but also briefly mention other settings such as "reconfigurable cryptography". We also provide first results on scalable *quantum-resistant* cryptography, i.e., scalable cryptographic schemes that remain secure even in the presence of a quantum computer.

**Keywords:** Public-key cryptography *·* Security reductions

# **1 Introduction and Motivation**

*Motivation: Public-Key Cryptography. . .* Public-key cryptography, introduced by Diffie and Hellman [13] in 1976, is at the heart of modern cryptography. A publickey encryption (PKE) scheme can be used to transmit messages securely by encrypting them. The main feature that distinguishes PKE schemes from earlier encryption schemes (and in particular from symmetric encryption schemes such as AES) is the existence of two separate keys: the encryption (or, public) key is used to encrypt messages, while the decryption (or, secret) key is used to decrypt ciphertexts.

Among the first suggested PKE schemes were the RSA scheme of Rivest, Shamir, and Adleman [35], and the scheme of Merkle and Hellman [31]. Later on, many more followed, e.g., [6,9,12,15,20,32]. Today, PKE schemes are crucially used to protect large-scale systems. For instance, PKE schemes secure Internet browsers [37] (including e-banking applications such as HBCI, the home banking computer interface standard), Internet auctions [10], or simply email [39]. We stress that such applications cannot be solved with more classical methods of encryption (like symmetric encryption) alone. However, symmetric encryption schemes like AES do play a role in making such applications more efficient.

It has become a standard requirement that a cryptographic scheme (and in particular a PKE scheme) should come with provable security guarantees. Indeed, the *in*security of a cryptographic scheme can have catastrophic consequences (think of an electronic voting scheme), and is usually not immediately detectable. Hence, security cannot be achieved using a trial-and-error method, and should be argued beforehand.

The dangers of a missing security proof are best demonstrated by the PKCS Internet browser encryption standard [36,37]. This de facto standard defines how browsers should encrypt their communication when accessing sensitive websites, e.g., for ebanking, or e-commerce. An older version of that standard [36] used a PKE scheme *without* security proof, and was subsequently broken by Bleichenbacher [8]. This caused massive media attention, and made expensive updates necessary. As a result, the updated standard [37] relies upon a variant of the RSA PKE scheme *with* (heuristic) security proof.

We stress that a security proof always refers to a formal security model which covers the possible attacks in practice. Goldwasser and Micali [20] gave the first formal security notion, and proved a simple (but comparatively inefficient) PKE scheme secure in this sense. Later on, more efficient provably secure PKE schemes were devised (e.g., [9,12]), and the considered security notions were refined (e.g., [14,32,33,38]).

*. . . in a Big Data Scenario.* Now consider the following simple but realistic example scenario. Namely, imagine that every owner of a smartphone encrypts all of his/her Internet communication (using a state-of-the art PKE scheme). Such an encryption already takes place for selected Internet connections, and usually for communication with email servers. However, for this example, we will assume that all communication is encrypted. This leads to a large-scale setting in which both the number of users and the number of encryptions is in the (large) millions. For simplicity, let us assume that there are *nU* = 230 users, each performing *nC* = 230 (i.e., about one billion) encryptions.1

We would like to derive provable security guarantees for the used encryption in this setting. This means that we would like to have a formal statement that the only way to break *any instance* of the used encryption scheme is to solve a (preferably well-

<sup>1</sup> Of course, many practical settings may actually involve fewer users or encryptions. To derive meaningful universal security guarantees, however, we are assuming what seems plausible in *some* realistic applications (like browser encryption or messaging apps).

understood) mathematical problem. Unfortunately, most existing PKE schemes do not scale well in this setting. For instance, the best known security guarantees for the PKCS encryption standard [37] degrade linearly in the number of users and ciphertexts. This means that while the scheme—implemented with current parameters and keylengths is believed to be secure against attacks of complexity 280, the best guarantees we can currently derive for the same scheme in a 230-user, 230-ciphertext setting are almost trivial. (Namely, in that setting, we can only guarantee that any attack on the scheme must have complexity at least 220, i.e., the equivalent of less than a second of computing time on a modern desktop PC.)2

*Goals of the "Scalable Encryption" Project.* The central goal of the "Scalable Encryption" project is to provide security models and cryptographic schemes that do scale well to Big Data scenarios. In particular, we provide cryptographic constructions that feature a "tight security proof" (i.e., a security reduction which gives guarantees that do not degrade in the size of the application setting). In the following, we will present and highlight the main contributions of the project.

# **2 Tightly Secure Cryptography**

Our first and central concrete goal was to construct cryptographic schemes (and in particular PKE and signature schemes) with security guarantees that do not degrade in larger settings. Technically, we have aimed at constructing cryptographic schemes with a tight security reduction to a standard computational assumption. Several of our works prepared in the course of the "Scalable Cryptography" project have dealt with this topic.

At the core of all of these techniques lies the observation that some computational problems (such as computing discrete logarithms in a cyclic group) are *rerandomizable*. That means that one problem instance *I* can be re-randomized to obtain many problem instances *I*1*,...,In*. The solution of any instance *Ii* will then also yield a solution for the original instance *I*. To show scalable security of, say, a PKE scheme, one would then start from a single instance *I*, and seek to embed many re-randomized problem instances *Ii* in different instances of the PKE scheme. (For instance, a problem instance *Ii* might correspond to the public key of a PKE instance, while the corresponding problem solution might correspond to the secret key.) If an adversary breaks any PKE instance, this leads to a solution for *Ii*, which in turn yields a solution for *I*. In other words, breaking *any* PKE scheme instance (from a selection of many PKE instances) is no easier than breaking a single given problem instance *I*.

There are a number of interesting computational problems (which are known to be cryptographically useful) with this re-randomizability property. However, the difficulty in executing the aforementioned strategy is to deal with *active* adversaries (that may, e.g., send maliciously formed ciphertexts to an honest user of the encryption scheme to see how this user reacts). Such adversaries may require a security reduction as above to also exhibit at least partial knowledge about the *secret key* of honest users. This makes

<sup>2</sup> We are also cautious when making assumptions about attacker complexity, and will typically assume liberal upper bounds. It should be noted, however, that current (publicly known) supercomputers are known to achieve almost 260 floating-point operations *per second*.

embedding a given challenge (with an *unknown* solution) into PKE instances much harder (since the embedded problem instance might also be easier to solve given that partial knowledge about the secret key).

In our work, we have found various technical ways to embed problem instances into PKE and other cryptographic schemes. Namely, in our work [5] (published at the TCC 2015 conference), we have presented a general framework for constructing PKE, signature, and key exchange schemes with tight security proofs even in the face of *adaptive* corruptions. We note that the emphasis of this work does not lie in practical schemes. We merely describe a general paradigm to achieve an additional security property (security against adaptive corruptions) in large scenarios.

Our work [28] (published at the PKC 2015 conference) presents an identity-based encryption (IBE) scheme secure in large scenarios. While there are previous IBE schemes whose security does not degrade in the number of *users*, our scheme is the first IBE scheme whose security properties do not degrade in the number of *ciphertexts*. Hence, our scheme is the first IBE scheme suitable for the (very realistic) scenario of a large number of encryptions per user. The techniques developed in this work could furthermore be used in our next work, [16] (published at the EUROCRYPT 2016 conference) to develop a tightly secure PKE scheme. Our scheme is the first PKE scheme for large scenarios that does not require a mathematical pairing. As a consequence, our scheme is based upon a very standard computational assumption (the Decisional Diffie-Hellman assumption), and very efficient. This work has been awarded the "Best Paper" at the EUROCRYPT 2016 conference.

Most tightly secure encryption schemes (including the ones from [28] and [16]) share the disadvantage of a large public key. The work [25] (published at the TCC 2016 conference) presents a technique to obtain tightly secure encryption and signature schemes with small public keys (and ciphertexts, resp. signatures). Indeed, we could show that the concepts introduced in [25] lead not only to tightly secure public-key encryption schemes with short public keys (published in [17] at the CRYPTO 2017 conference), but also to tightly secure *structure-preserving* signature schemes (published in [1,18] at the CRYPTO 2017 and EUROCRYPT 2018 conferences), and identity-based encryption schemes [27] (published at ASIACRYPT 2018).

At this point, it might be interesting to explain the importance of *structurepreserving* cryptographic building blocks (like our signature schemes from [1,18]). Informally, a structure-preserving building block is one in which all public operations are algebraic (in a formally defined sense). As a consequence, it is possible to efficiently conduct non-interactive zero-knowledge proofs about their execution (e.g., using the highly efficient proof system of Groth and Sahai [21]). In other words, it is possible to efficiently and publicly prove, e.g., knowledge of a signature without releasing that signature. This enables applications like anonymous credentials (i.e., secure digital identities) which rely on *not* releasing all available information publicly. Our tightly secure structure-preserving signature schemes are the first of their kind, and form highly flexible and universal components for scalable such systems.

Our work [7] (published at the PKC 2015 conference) provides a new framework for obtaining digital signatures with a tight security reduction from standard hardness assumptions. Concretely, we show that any Chameleon Hash function can be transformed into a tree-based signature scheme with tight security. Our framework explains and generalizes most of the existing schemes as well as providing a generic means for constructing tight signature schemes based on arbitrary assumptions, which improves the standard Merkle tree transformation. Moreover, we obtain the first tightly secure signature scheme from the SIS assumption and several schemes based on Diffie-Hellman in the standard model.

Our paper [23] (also published at the PKC 2015 conference) considers security notions for public-key encryption in a slightly more realistic multi-challenge model. We show that two well-known and widely employed public-key encryption schemes—RSA Optimal Asymmetric Encryption Padding (RSA-OAEP) and Diffie-Hellman Integrated Encryption Standard (DHIES)—are secure in this model. Surprisingly, our reductions are optimal in terms of tightness in the sense that they are as tight as the ones for standard security. In the follow-up work [24] (to be published at the ASIACRYPT 2016 conference) we derive new and tight bounds for the composition of symmetric and asymmetric primitives. In particular, we consider the realistic cases where the symmetric part consists of popular modes of operation like CTR, CBC, CCM, and GCM.

We also investigate a similar generic encryption technique, the "Fujusaki-Okamoto" method to achieve secure encryption. Namely, in [26] (published at the TCC 2017 conference), we show that variants of this method achieve tight security or security against quantum computers. Similarly, and even more generically, the work [19] (published at the PKC 2018 conference), investigates the tightness of the generic "KEM-DEM" paradigm to achieve efficient public-key encryption schemes.

In the paper [29] (published at the CRYPTO 2016 conference), we perform a concrete security treatment of digital signature schemes obtained from canonical identification schemes via the Fiat-Shamir transform. If the identification scheme is random self-reducible and satisfies the weakest possible security notion (hardness of keyrecoverability), then the signature scheme obtained via Fiat-Shamir is unforgeable against chosen-message attacks in the multi-user setting. Previous reductions incorporated an additional multiplicative loss of *N*, the number of users in the system. As an important application of our framework, we obtain a concrete security treatment for Schnorr signatures in the multi-user setting.

In the work [3] (published at the CRYPTO 2017 conference), we consider the "memory-tightness" of security reductions, as opposed to the "runtime-tightness" more commonly considered (in particular in most of the works from the previous subsection). Interestingly, this work finds that sometimes, security reductions have an inherent *intrinsic memory usage* (i.e., the reduction necessarily requires a significant amount of memory to perform its job), and that sometimes this memory usage grows with the size of the application setting. This yields another dimension of relations between different problems (and the security of certain cryptographic schemes), and shows that the scalability of cryptographic schemes can be a multi-faceted question.

The work [4] (published at the EUROCRYPT 2020 conference) which does not consider security guarantees (as given, e.g., by a security reduction), but instead investigates how the best concrete attacks on cryptographic schemes scale to larger scenarios. As a result, this work gives lower bounds (and thus also security guarantees) by more directly considering all possible attacks in a generalized setting, the generic group model.

The results we have surveyed so far are concerned with the quality of a security reduction as a measure of scalability. This is a very important factor when deriving concrete security guarantees, but not the only one. For instance, in our work [22] (published at the TCC 2016 conference), we have formalized the notion of reconfigurable cryptographic schemes. A reconfigurable scheme allows to adapt its security parameter (i.e., the quantitative level of given security guarantees) on the fly, without changing all registered user public keys (e.g., for encryption or signature schemes). Hence, reconfigurable cryptographic schemes avoid an expensive update of potentially huge public key databases.

This work also contains proof-of-concept PKE and signature schemes. In these schemes, every user has a long-term public key and secret key. The security of these long-term keys is based on very weak assumptions from the realm of secret-key cryptography: in our PKE scheme, for instance, the public key is the image of the secret key under a generic pseudo-random generator. These long-term keys are not directly used to encrypt or decrypt. Instead, they are used to derive short-term keys (e.g., for the RSA PKE scheme) of any desired bitlength that are then used for encryption or decryption.

# **3 Post-Quantum Cryptography**

The security of all currently used asymmetric (public-key) cryptography relies on the intractability of only two number-theoretic intractability problems, namely the factoring problem and the discrete logarithm problem over elliptic curves and finite fields. This "monoculture" poses a dangerous security threat as, in the not too unlikely scenario of scalable quantum computers, Shor's algorithm will render all the asymmetric cryptosystems in current use immediately insecure: All data transmitted over encrypted channels - past and present - will immediately become public. This in particular also holds for the cryptography considered in the previous section. Leading international tech companies like Google and Microsoft are currently investing in building quantum computers. It can only be speculated whether large intelligence agencies are already in possession of a cryptologically useful quantum computer. For that reason, a number of standardization bodies (such as NIST) are currently selecting quantum-secure asymmetric cryptosystems. Promising candidates for building quantum-resistant asymmetric cryptosystems are, amongst others, based on finding solutions to certain difficult problems regarding codes and lattices. In this project we also worked on the foundations to find truly practical, and at the same time, provably secure encryption schemes, key exchange protocols, signature schemes, and more complex protocols based on well understood and meaningful hard mathematical problems over codes and lattices.

In the context of cryptography, a lattice is a (full-rank) discrete subgroup of *Rn*, commonly described by a basis. Basic lattice-based cryptosystems have already existed for almost two decades and are arguably among the most promising candidates for quantum-resilience. They are simple and efficient in that their algorithms consist mostly of matrix operations, and they currently resist sub-exponential and quantum attacks. Drawing on the seminal work of Ajtai in 1996 [2], we are able to connect the averagecase complexity of lattice problems (upon which the security of our schemes is based) to their complexity in the worst case. The latter property is unique among all known hardness assumptions and is one of the many reasons why people believe in its intractability. In this context the "learning with errors" (LWE) problem emerged as a suitable abstraction for a hard problem on lattices since it was shown that solving this problem would imply breaking a few well-studied lattice-problems in the worst case, such as the approximate shortest vector problem.

In [11] (published at EuroS&P 2018) we proposed Kyber, a simple and fast encryption scheme. The design of Kyber has its roots in the seminal LWE-based encryption scheme of Regev [34]. Since Regev's original work, the practical efficiency of passively secure LWE encryption schemes has been improved by observing that the secret key can come from the same distribution as the noise and also noticing that"LWE-like" schemes can be built by using a square (rather than a rectangular) matrix as the public key. Kyber does some further efficiency improvements such as dropping several bits from the public-keys and ciphertexts to save bandwidth. At the core of its security analysis lies the security reduction of the Fujusaki-Okamoto transformation [26] already mentioned in Sect. 2, which transforms any passively secure encryption scheme into one withstanding active adversaries. The key feature here is that the security reduction is tight, i.e., it does not degrade with the number of evaluations of the hash function. This, together with Kyber's extremely fast performance, makes it very suitable for bigdata scenarios. As of 2020, Kyber has been selected by the NIST as one of the finalists of its Post-Quantum Cryptography Standardization process for public-key encryption.<sup>3</sup>

# **4 Open Questions**

Although the project significantly advanced our understanding of scalable security (and in particular scalable security *guarantees*), many questions remain. First, we are still missing technical tools to tackle the tight security of all cryptographic building blocks: the tight security (and thus the scalability) of *hierarchically organized* schemes (such as HIBE or hierarchical signature schemes) is not well-understood, and most known results (such as [30]) are negative. Besides, there are few results about the scalability of new and modern cryptographic building blocks such as obfuscation or functional or homomorphic encryption schemes. Even though these building blocks are extremely powerful (and imply a multitude of other building blocks and tasks), their scalability is currently unclear.

Moreover, the interplay between cryptanalytic attacks and the guarantees given by security reductions is generally not well-understood. The work of [4] is a promising step in this direction, but there remains a lot to be done.

# **References**


<sup>3</sup> https://csrc.nist.gov/projects/post-quantum-cryptography.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# Distributed Data Streams

Jannik Castenow, Björn Feldkord, Jonas Hanselle, Till Knollmann, Manuel Malatyali, and Friedhelm Meyer auf der Heide(B)

> Paderborn University, Paderborn, Germany {jannik.castenow,bjoern.feldkord,jonas.hanselle, manuel.malatyali,fmadh}@upb.de, tillk@mail.upb.de

Abstract. We consider a scenario where a server is wirelessly connected to countless sensor nodes that continuously measure data. The task of the server is to monitor the sensors' data. More precisely, at each time step the server calculates a function defined over the current measurements of the sensors. Since the sensors only have small computational power and tight battery constraints, the communication between the server and the sensors should be as small as possible, i.e., we aim at minimizing the total number of messages that is transferred.

There are various conceivable problems for the setting above. We demonstrate our approaches on the following three: In the Top-*k*-Value Monitoring Problem, the server aims at identifying the *k* largest values. The Top-*k*-Position Monitoring Problem shifts the task to identify the sensors observing these values. Finally, the Count Distinct Monitoring Problem obliges the server to determine the number of distinct values currently observed.

For all three problems, we not only present communication-efficient protocols for one time step, we also show how it can be exploited if the input at sensors is similar between consecutive time steps to reduce the total communication on the long term. Thereby, we utilize different techniques – involving sampling, dynamic data structures, filter-based approaches, and combinations of them – to demonstrate their potential and their limits in the broad setting described above.

Keywords: Top-k · Count distinct · Distributed monitoring · Distributed data streams

# 1 Introduction

Envision a scenario where a set of tiny, lightweight sensors is distributed in a hazardous area (e.g., an ocean, high mountains or in space) to monitor the environment. The sensors are connected to one or multiple central servers which have the task to track the measurements of the sensors, i.e., the servers have to compute a function of the sensor values at every point in time. This task is easy to solve as long as the sensors continuously send their current measurements to the servers and the latter ones have enough memory and computational power to do computations on the sensor data at every point in time. Realistic applications, however, require a huge number of sensors (e.g., because the area is very large, sensors are error-prone, sensors have only a limited battery lifetime, . . . ) that cannot be handled by modern server hardware, or the number of required severs might be uneconomically expensive. Additionally, sending the measured data of the sensors continuously to the servers leads to a rapid decrease of the sensors' batteries. Therefore, to build a feasible system, the communication between sensors and servers needs to be severely reduced.

We consider two types of randomized algorithmic approaches to reduce the communication: The first approach is based on Monte Carlo algorithms. Sensors decide randomly to communicate their current observed value to the server. The probability of sending a message depends on the significance of the current observed value: If the impact on the output function is small, the probability of sending a message is low; if the impact is high also the probability of sending a message is high. Thus, the server is not aware of all changes in the sensor values but with a high probability it gets to know all significant changes. With this approach, the server is able to compute a correct output with a high probability. In some scenarios (for instance in safety-critical systems), the application demands to always compute a correct output. Here, we exploit the idea of Las Vegas algorithms that reduce the number of sent messages with a high probability but always compute the exact output. With a low probability many messages may be sent, however the server can always be sure to compute the correct output. All in all, these two approaches build a trade-off between reducing the communication and computing correct outputs, while the randomization helps to keep the trade-off small.

Considering the scenario described above, we are interested in multiple problems. In the Top-k-Value Monitoring problem, the server is interested in the k largest values observed by the sensors at any time. In contrast to that, the Topk-Position Monitoring problem tackles the case where the server is interested in the actual sensors measuring the k largest values, e.g., to track if large values and the set of sensors observing them are correlated. As in a lot of cases a rough estimate on the top k positions is sufficient, we also address the Approximate Top-k-Position Monitoring problem. Besides the largest values, the server might also be interested in how many different values are observed to get an overview on the global situation. This is captured in the (Approximate) Count Distinct Monitoring problem.

The aforementioned problems have in common that in practice a lot of communication can be avoided compared to a naïve approach that gathers all sensor data at every time step at the server. For example, consider the (Approximate) Count Distinct Monitoring problem. If a subset of the sensors observes identical values, not all of them need to communicate their observation to the server. See Fig. 1 for a depiction. Here, a horizontal block indicates a set of sensors observing the same value at the same time. Optimally, only one of them would need to communicate the value to the server. Additionally, if the value that is observed by a fixed sensor does not change significantly as time goes, the sensor does not need to notify the server all the time. Furthermore, observations of sensors that are not of interest should not be communicated. This can be seen when considering the Top-k-Value Monitoring problem. It would be best if all sensors not

Fig. 1. We consider a central server that is connected to a set of sensor nodes. As time goes, each sensor observes a sequence of values (indicated by the dots below the sensors). Among others, communication can be avoided if a group of sensors observes the same value at the same time (horizontal blocks).

observing one of the k largest values do not communicate at all. Note, that for this it is required that all the sensors can receive information from the server. In our model, we allow the server to have a cheap broadcast channel. This can be assumed, as the central server has no need to reduce its power consumption as opposed to the sensors.

In this paper, we examine how communication can be minimized in the problems above. Our focus is especially a theoretical analysis of techniques that allow to capture the idea that sensor data might not change arbitrarily between consecutive time steps. We examine, among other things, how to use dynamic data structures and restrictions on the adversary dictating the inputs at sensor nodes, such that an algorithm can keep/update an existing solution for more than one time step and reduce the overall communication.

We begin in Sect. 2 by a formal introduction of our model and the problems we consider. We also introduce a major technique that we use called *filters*. Afterwards, we establish computational primitives in Sect. 2.3. In Sect. 3 we deal with the Top-k-Value Monitoring problem followed by the Top-k-Position Monitoring problem in Sect. 4 and the (Approximate) Count Distinct Monitoring problem in Sect. 5.

This paper surveys results from [1 SPP,4 SPP,8 SPP,9 SPP,10 SPP] . We only give short sketches of algorithms and proofs. For technical details, please look at the papers above. A detailed description of the current state of the art is presented in [10 SPP].

# 2 Model

In our setting there are n nodes connected to a single server. The nodes are uniquely identified by IDs from the set {1,...,n} and each node <sup>i</sup> receives (v1 <sup>i</sup> , v<sup>2</sup> <sup>i</sup> , v<sup>3</sup> <sup>i</sup> ...) as a stream of data. At time <sup>t</sup>, a node <sup>i</sup> observes <sup>v</sup><sup>t</sup> <sup>i</sup> <sup>∈</sup> <sup>N</sup> and does not know any vt- <sup>i</sup> , t > t. The superscript t is omitted if it is clear from the context.

Following the model in [3], we allow that between any two consecutive time steps, a *communication protocol* exchanges messages between the server and the nodes. The communication protocol is allowed to use a number of rounds polylogarithmic in <sup>n</sup> and max1≤i≤n(v<sup>t</sup> <sup>i</sup> ). Nodes can only send messages to the server and they are able to store a constant number of integers, compare two integers and perform Bernoulli trials with success probability 2<sup>i</sup> /n for <sup>i</sup> ∈ {0,..., log <sup>n</sup>}. The server can communicate to one node directly or utilize a broadcast-channel to send one message to all nodes simultaneously. All communication methods described above incur unit communication cost per message, the delivery is instantaneous, and we allow a message at time t to have a size which is logarithmic in <sup>n</sup> and max1≤i≤<sup>n</sup>(v<sup>t</sup> i ).

A time step t defines a point in time at which the sensor nodes obtain a new piece of input (v<sup>t</sup> <sup>i</sup> for node i at time t). The protocol consists of multiple (communication) rounds: Each sensor node performs local computations and may send a message to the server. The server collects all messages, performs local computations and may send a message via the broadcast-channel to all sensor nodes.

Since all nodes are synchronized, the server can detect if no sensor sends a message and the sensor nodes can identify if the server did not send a message. Furthermore, the server has unrestricted capacity when receiving, i.e., it can always receive all messages that are send to it.

At the end of each round, when the communication protocol terminated, the server decides on the output of the function for the current time t and the whole network proceeds to the next time step <sup>t</sup> + 1.

We assume that all observed values are pairwise different for the (Approximated) Top-k-Value and Top-k-Position Monitoring Problems and coupe with a large number of duplicates considering the (Approximate) Count Distinct Problem.

# 2.1 Problems

Our focus here is on three problems; the Top-k-Value Monitoring problem, the Top-k-Position Monitoring problem and the Count Distinct Monitoring problem. In the Top-k-Value Monitoring problem, we are interested in the largest observed values, i.e., the ordering of the values is of special interest. Let s<sup>t</sup> 1,...,s<sup>t</sup> <sup>n</sup> be the values observed at time t (v<sup>t</sup> 1,...,v<sup>t</sup> <sup>n</sup>) sorted in descending order.

Definition 1 (Top-k-Value Monitoring). *In the Top-*k*-Value Monitoring problem, the server has to output* s<sup>t</sup> 1,...,s<sup>t</sup> <sup>k</sup>*,* <sup>k</sup> <sup>≤</sup> <sup>n</sup> *at each time* <sup>t</sup>*.*

In contrast to keeping track of the values it might be more of interest to keep track of the nodes observing the largest values instead (for instance in safety critical applications). This is considered in the Top-k-Position Monitoring problem.

Definition 2 ((Approximate) Top-k-Position Monitoring). *In the Top*k*-Position Monitoring problem, the server has to output at each time* t *the* k *nodes observing* s<sup>t</sup> 1,...,s<sup>t</sup> <sup>k</sup> *– called the top-*k*. If we are interested in an approximation, we need some more notation. For any constant* <sup>ε</sup> <sup>∈</sup> (0, 1) *let* <sup>E</sup>(t) := {i<sup>|</sup> <sup>v</sup><sup>t</sup> <sup>i</sup> <sup>∈</sup> ((1−ε)−<sup>1</sup> <sup>s</sup><sup>t</sup> <sup>k</sup>,∞)} *be the set of nodes observing values which are significantly larger than the* k*th largest one. In the Approximate Top-*k*-Position Monitoring problem, at each time* <sup>t</sup> *the server has to output* <sup>E</sup>(t) *and* <sup>k</sup> − |E(t)<sup>|</sup> *many nodes not in* <sup>E</sup>(t) *observing a value which is at least* (1 <sup>−</sup> <sup>ε</sup>) <sup>s</sup><sup>t</sup> k*.*

In the case that multiple nodes observe the same value, one might be more interested in how many different values are observed. We approach this direction by the Count Distinct Monitoring problem. Note, that we do not assume all values to be distinct when discussing this problem.

Definition 3 ((Approximate) Count Distinct Monitoring). *For a fixed time step* t*, let* d<sup>t</sup> *be the number of distinct values observed by all nodes, i.e.,* <sup>d</sup><sup>t</sup> <sup>=</sup> |{v<sup>t</sup> <sup>i</sup> <sup>|</sup><sup>i</sup> ∈ {1,...,n}}|*. At each time step* <sup>t</sup> *the server has to output* <sup>d</sup>t*. In the approximation variant, the server has to output an* (ε, δ)*-approximation at each time step* <sup>t</sup>*, i.e.; for two constants* 0 <sup>≤</sup> , δ <sup>≤</sup> 1*, the server has to compute a value* <sup>x</sup> <sup>∈</sup> [(1 <sup>−</sup> <sup>ε</sup>) · <sup>d</sup>t,(1 + <sup>ε</sup>) · <sup>d</sup><sup>t</sup>] *with probability at least* <sup>1</sup> <sup>−</sup> <sup>δ</sup>*.*

Explicitly, the values at times t < t do not matter for the output at time t.

# 2.2 Filter-Based Algorithms

One of our main techniques is the usage of *filters*. A filter defines for each sensor an interval of values that do not influence the output function. Filtering the input for an algorithm occurs in many different contexts. In algorithm engineering, filtering has turned out to be a valuable tool to decrease the input size to speed up the computation in certain cases. For instance the *Filter-Kruskal* algorithm can accelerate the computation of minimum spanning trees of graphs [12]. It improves the *qKruskal* algorithm which combines the original *Kruskal* algorithm with the partitioning idea of *QuickSort* – the edges are not sorted beforehand but a pivot edge is chosen, the problem is solved recursively on all edges with smaller weight and afterwards (provided the spanning tree is still incomplete) on all edges having a larger weight. *Filter-Kruskal* improves this by not using all edges of larger weight as an input for the second recursive call but only those edges which actually connect two different components of the graph, i.e., it filters all edges that cannot be part of the minimum spanning tree. This idea has also been applied to different problems later on, e.g., graph matching [11].

Another filtering approach with numerous applications is *Kalman Filtering*, also known as *Linear Quadratic Estimation*. Its goal is to predict the state of a system based on observations containing inaccuracies. It works in two steps: First, system parameters are predicted and afterwards, the predictions are updated as soon as the next observation (measurement with inaccuracies) arrives using a stochastic weighted average approach. Applications of Kalman Filtering can be found in various areas, among others navigation control of vehicles, robot motion planning, and signal processing. It is also provably a valuable tool for data stream analysis. Similar to our goal, Kalman Filtering is used to reduce the communication in a sensor server architecture in [5]. Here, Kalman Filtering is applied on both the server and the sensor side (the sensors provide a data stream for the server). As long as the sensor observes values that are within a small deviation of its current prediction, the sensor does not communicate to the server. Once the deviation exceeds a certain threshold, the sensor updates the server.

Next, we introduce the formal notion of filters and necessary definitions for our model. A set of filters is a collection of intervals, one assigned to each node such that, as long as the observed values at each node are within the given interval, the value of the output function does not change.

Definition 4. *For a fixed time* t*, a* set of filters *is defined by an* n*-tuple of intervals* (F<sup>t</sup> 1,...,F<sup>t</sup> <sup>n</sup>)*,* <sup>F</sup><sup>i</sup> <sup>⊆</sup> <sup>N</sup> ∪ {−∞,∞} *with* <sup>v</sup><sup>t</sup> <sup>i</sup> <sup>∈</sup> <sup>F</sup><sup>t</sup> <sup>i</sup> *, such that as long as the value of node* i *only changes within its interval,i.e., it holds* v<sup>t</sup>- <sup>i</sup> <sup>∈</sup> <sup>F</sup><sup>t</sup>- <sup>i</sup> <sup>=</sup> <sup>F</sup><sup>t</sup> <sup>i</sup> *for* t <sup>≥</sup> <sup>t</sup>*, the value of the output function does not change. We use* <sup>F</sup><sup>t</sup> <sup>i</sup> = [<sup>t</sup> <sup>i</sup>, u<sup>t</sup> <sup>i</sup>] *to denote the lower and upper bound of a filter interval, respectively.*

We assume that nodes are assigned filters by the server. If a node *violates* its filter, i.e., the currently observed value is not contained in its filter, the node may report the violation and its current value to the server. The server then computes a new set of filters and sends them to the affected nodes. To calculate a set of filters that works for the entire set of nodes, the server may need to probe some more nodes before sending out the new filters. At the end of each time step, no node is allowed to violate its filter. An algorithm following this approach is called *filter-based*.

The easiest way of defining a set of filters is to assign the value a node currently observes as its interval. In this case the usage of filters is not very beneficial, so we are looking for filters that are as large as possible to minimize the number of filter changes which is directly related to the number of exchanged messages.

Our analysis is based on the classical competitiveness approach first used in [13] and later on formalized by [6]; see also [2] for an overview. We compare the communication volume of our algorithms to one of an appropriately defined offline algorithm. In our model, a general offline algorithm knows all the input streams in advance and can trivially solve the aforementioned problems without any communication. To still get meaningful results regarding the quality of our algorithms, we assume the optimal offline algorithm OPT uses filters assigned by the server to the nodes. To lower bound the cost of OPT, we count the number of filter updates over time.

Definition 5 (Competitive Algorithms). *We call a (randomized) online algorithm* ALG c-competitive *if for every instance its (expected) communication volume is by a factor of at most* c *larger than the communication volume of* OPT*.*

# 2.3 Computational Primitives

This section is dedicated to three subroutines that will be used in later algorithms. Due to space constraints, we will use these protocols mainly as blackboxes, see the cited literature for more details. The first subroutine is a protocol for the Existence problem. In this problem, all nodes observe binary values, <sup>∀</sup><sup>i</sup> ∈ {1,...,n} : <sup>v</sup><sup>i</sup> ∈ {0, <sup>1</sup>} and the goal for the server is to output the *logical disjunction*.

The Existence Protocol solves this problem in log(n)+1 rounds. In each round <sup>r</sup> = 0, 1,..., log <sup>n</sup>, all nodes that have observed the value 1 send a message with probability 2<sup>r</sup>/n to the server. As soon as the first message reaches the server, the protocol ends (latest if <sup>r</sup> = log(n) holds).

Theorem 1 (Existence). *[9 SPP] There exists an algorithm* Existence Protocol *which uses <sup>O</sup>* (1) *messages in expectation and at most* log <sup>n</sup>+ 1 *communication rounds to solve the problem* Existence*.*

The Existence Protocol has several applications. Most important for our research is the detection of filter violations. The server can detect a filter violation using only a constant number of messages on expectation.

Corollary 1 (Filter Violation). *[9 SPP] There is a protocol* Existence Protocol *which uses <sup>O</sup>* (1) *messages in expectation to identify a filter violation. In case there are multiple filter violations one is drawn uniformly at random. If no filter violation occurs no message takes place.*

Additionally, the Top-k Protocol is able to solve the Top-k-Value Monitoring problem. The protocol uses similar ideas as the Existence Protocol: Nodes draw a height from a geometric distribution and a tree like structure is built. Initially, <sup>s</sup>1 is determined by collecting a sample of all values, broadcasting the largest one and continuing until <sup>s</sup>1 is determined. The same idea is used to find s2,...,sk.

Theorem 2 (Top-k). *[4 SPP] The* Top-k Protocol *uses* <sup>k</sup>+log(n)+2 *messages in expectation and <sup>O</sup>* (<sup>k</sup> + log <sup>n</sup>) *expected number of rounds to solve the Top-*k*-Value Monitoring problem.*

# 3 Top-*k*-Value Monitoring

In this section we consider problems regarding the k largest values at the current time step t. We design and analyze Las Vegas algorithms, i.e., we always output the correct values and can show that the total communication and number of rounds are polylogarithmic with high probability.

Consider a general input for the Top-k problem over time. The values might change over time as well as the nodes holding the Top-k values. As a consequence, in a worst-case situation we cannot reuse any information from previous time steps and need to recompute the output from scratch. To counteract this possibility, we consider two different approaches.

In our first approach, we restrict the number of values which can change between queries, and parameterize the result in this number. We show that we can build up a data structure which preserves important information as long as there are not too many updates. This makes answering queries much more efficient, as we use the data structure to quickly reduce the number of candidate nodes which potentially hold the desired result.

In our second approach, we consider filter-based algorithms for the problem. These algorithms have the advantages discussed in Sect. 2.2, i.e., they are very effective if the changes in the output are not too large. To conduct a meaningful worst-case analysis, we consider the competitiveness of the algorithms against a filter-based offline algorithm.

Before we give the details of our solutions, we shortly mention that our protocols which compute the Top-k from scratch are essentially optimal with respect to the amount of communication. Intuitively, we can show that an algorithm cannot do much better than performing a binary search on n values. The algorithm can always ask a set of nodes for their value, and then broadcast the maximum to 'eliminate' all nodes with a smaller values for the process. Formally, Yao's minimax principle considering a random permutation as input can be applied. Each input occurs with probability (1/n!) and it is shown that any deterministic algorithm needs at least <sup>Ω</sup> (log <sup>n</sup>) messages on expectation which yields:

Theorem 3 ([8 SPP]). *Every comparison-based randomized algorithm requires at least* <sup>Ω</sup> (log <sup>n</sup>) *messages on expectation to compute the maximum in our model.*

#### 3.1 Dynamic Distributed Data Structure

In this section we consider a data structure for the rank related problems of Top-k and k-Select. The k-Select problem asks to identify the data item with rank k. We consider the approximate version, where we have to output an item with rank in [(1 <sup>−</sup> <sup>ε</sup>)k,(1 + <sup>ε</sup>)k] with probability at least 1 <sup>−</sup> <sup>δ</sup>. An approximate version with weaker conditions will also help us to solve the Top-k problem. For the bounds on communication, we consider the following setting: Only when there is a query for the Top-k or for k-Select, the output is determined. We allow the parameters to be different from query to query and, furthermore, we allow multiple k-Select queries at the same time step.

Our results are based on the idea of maintaining a (distributed) data structure which is used to answer a query and is informed about each update. More precisely, at every point in time, the data structure keeps track of an approximation of a data item with rank k. These approximations can be exploited by the protocols for a Top-k or k-Select computation to significantly decrease the communication and, interestingly, also the time bounds, rendering this approach very powerful.

The data structure supports the following operations: Top-k: Output {st 1,...,s<sup>t</sup> <sup>k</sup>}, StrongSelect: Output <sup>d</sup> ∈ {s<sup>t</sup> (1−ε)<sup>k</sup>,...,s<sup>t</sup> (1+ε)<sup>k</sup>} and WeakSelect: Output d with s<sup>t</sup> <sup>k</sup>·log*c*<sup>1</sup> <sup>n</sup> <sup>≤</sup> <sup>d</sup> <sup>≤</sup> <sup>s</sup><sup>t</sup> <sup>k</sup>·log*c*<sup>2</sup> <sup>n</sup>, with <sup>c</sup>1, c<sup>2</sup> <sup>&</sup>gt; <sup>1</sup>. The Top-k and StrongSelect operations answer queries for the Top-k and k-Select problems, while the operation WeakSelect supports the other two. Our data structure guarantees the following:

Theorem 4 ([4 SPP]). *There is a distributed data structure with expected amortized total communication cost for an update of <sup>O</sup>* (1/ polylog <sup>n</sup>)*. The amortized number of rounds for an update is <sup>O</sup>* (1)*. The data structure is able to answer a* <sup>k</sup>*-Select query correctly with probability at least* 1 <sup>−</sup> <sup>δ</sup>*. For that, <sup>O</sup>* (1/ε<sup>2</sup> log 1/δ + (log log <sup>n</sup>)<sup>2</sup>) *messages and <sup>O</sup>* (log log <sup>n</sup> <sup>k</sup> ) *rounds are required on expectation. Additionally, the expected total communication cost to answer a Top-*<sup>k</sup> *query is <sup>O</sup>* (k+log log <sup>n</sup>) *and the expected number of rounds is <sup>O</sup>* (log log <sup>n</sup>)*. The output is always correct.*

Our data structure is designed as follows. We maintain a *Sketch(t)* about the data items received at time t in the server. The task of such a sketch is to maintain items to answer WeakSelect queries instantly. A *Sketch(t)* is a subset of data items denoted by {r<sup>t</sup> 1,...,r<sup>t</sup> <sup>m</sup>}, where <sup>m</sup> <sup>≤</sup> log <sup>n</sup>. We call *Sketch(t)* correct if it consists of a set of data items {r1,...,r<sup>m</sup>} such that, for each <sup>k</sup> = 1,...,n, there exists r<sup>k</sup> such that s<sup>t</sup> <sup>k</sup>·log*c*<sup>1</sup> <sup>n</sup> <sup>≤</sup> <sup>r</sup><sup>k</sup> <sup>≤</sup> <sup>s</sup><sup>t</sup> <sup>k</sup>·log*c*<sup>2</sup> <sup>n</sup>. We say the data item <sup>r</sup><sup>k</sup> is the representative of the set of data items <sup>d</sup> with <sup>s</sup><sup>k</sup>·log*c*<sup>1</sup> <sup>n</sup> <sup>≤</sup> <sup>d</sup> <sup>≤</sup> <sup>s</sup><sup>k</sup>·log*c*<sup>2</sup> <sup>n</sup>. To answer the WeakSelect query for a specific rank in [<sup>k</sup> · log<sup>c</sup><sup>1</sup> n, k · log<sup>c</sup><sup>2</sup> <sup>n</sup>], we simply output the representative <sup>r</sup><sup>k</sup>+1.

Computing a *Sketch* is somewhat expensive, hence we want it to be valid even after some values have been updated. It is easy to see that for appropriately chosen constants <sup>c</sup>1, c2, up to log<sup>c</sup> <sup>n</sup> values can change without the property being lost. In conclusion, we can achieve the stated performance guarantees by computing a *Sketch* which is valid for log<sup>c</sup> <sup>n</sup> updates, after which we recompute it from scratch. The WeakSelect operation simply returns an appropriate element from the *Sketch*.

Now, recall that there is a protocol for Top-k which uses <sup>k</sup> + log(n)+1 messages and *<sup>O</sup>* (<sup>k</sup> + log <sup>n</sup>) rounds in expectation (Theorem 2). These bounds hold when the protocol is executed on n nodes without using any information from previous time steps. We can now utilize our *Sketch* in the following way: We execute a WeakSelect operation with input k, such that we receive a data item d of size at most s<sup>t</sup> <sup>k</sup>·log*c*<sup>2</sup> <sup>n</sup>. Then, we execute the Top-k protocol only for nodes which hold a data item smaller than d, i.e., we execute the protocol only on *<sup>O</sup>* (<sup>k</sup> log <sup>n</sup>) nodes instead of <sup>n</sup>, yielding the desired bound. The bound on the StrongSelect operation can be obtained in a similar fashion.

#### 3.2 Filter-Based Algorithm

We turn our attention to filter-based algorithms which we evaluate in the framework of competitive analysis. We are going to compare the algorithm against an optimal offline algorithm, which knows all of the future input in advance. To make this analysis meaningful, it is necessary to also restrict the offline algorithm to a filter-based approach. The important part of the filter-based approach is that the offline algorithm has to communicate a set of valid filters to the nodes. In accordance to Definition 4, this means that the offline algorithm at least has to communicate each time the output changes.

The algorithm works as follows: First, the k largest values are determined using the Top-k-Protocol of Theorem 2 . Afterwards, the server broadcasts s<sup>k</sup> such that all nodes <sup>i</sup> with <sup>v</sup><sup>i</sup> <sup>≥</sup> <sup>s</sup><sup>k</sup> define their filter to <sup>F</sup><sup>i</sup> := [vi, vi] and the remaining nodes <sup>i</sup> with <sup>v</sup><sup>i</sup> < s<sup>k</sup> to <sup>F</sup><sup>i</sup> := [−∞, sk]. Whenever a node with one of the k largest values observes a different value, a filter violation occurs such that the node sends a message to the server. Each of the other nodes (those with filters <sup>F</sup><sup>i</sup> := [−∞, s<sup>k</sup>]) that observes a filter violation executes the Top-k-Protocol (to prevent that every node sends a message). The server unifies and outputs the k largest values of the nodes without a filter violation from the past time step and the new values of the current time step. This algorithm has the following guarantees.

Theorem 5 ([4 SPP]). *There is an online algorithm which monitors the Top*<sup>k</sup>*-Values and is <sup>O</sup>* (<sup>k</sup> + log <sup>n</sup>)*-competitive against an optimal filter-based offline algorithm.*

# 4 Top-*k*-Position Monitoring

In this section we consider monitoring the IDs of the nodes which observe the Top-k values rather than the values themselves [8 SPP,9 SPP]. The intuitive advantage is that small updates to the values of the nodes holding the Top-k do not necessarily mean a change in the Top-k positions. Hence, in a scenario where there are a lot of small fluctuations in the observed values but the overall ranking of nodes stays the same, we have to utilize much less communication if we monitor the nodes.

We only consider filter-based algorithms in this section. For the general approach as in Sect. 3.1, there is no further benefit from monitoring only the positions, as the entire data structure approach aims at optimizing cases in which only a fraction of nodes observe new values. On the other hand, it directly provides a solution for the positions since nodes can always send their IDs along with their values.

For the filter-based algorithm, we expect less communication due to the reason explained above. In fact, we observe an increase in the competitive ratio for the position monitoring: Under worst-case input sequences, the offline algorithm can gain a greater advantage in comparison to the online algorithm.

Theorem 6 ([10 SPP]). *Let each sensor node observe values from* 1,...,Δ*. There is an online algorithm which monitors the Top-*k*-Positions and has a competitiveness of <sup>O</sup>* (<sup>k</sup> + log <sup>n</sup> + log <sup>Δ</sup>) *compared to a filter-based offline algorithm.*

## 4.1 Filter-Based Top-*k*-Position Monitoring

The main observation for our approach is that for this problem it is sufficient to send only a single value v which divides the Top-k from the remaining nodes, i.e., a value which is between the <sup>k</sup>th and the (<sup>k</sup> + 1)st largest value. Based on this observation, the main task for the online algorithm is to decide where to set the value v which divides the Top-k and the remaining sensor nodes from each other. Since no information about the future is known, and the adversary has no restriction in the process of generating the values that the sensor nodes observe in future time steps, we simply take the median value.

*Top-k Position protocol:* Initially identify the <sup>k</sup>th and (<sup>k</sup> + 1)st largest values and the respective sensor nodes (using the one-shot protocol). As long as the Top-k-Positions do not change, define the bound for the filters as the median value between the <sup>k</sup>th and the (<sup>k</sup> + 1)st largest value.

In addition to the execution of the one-shot protocol from Theorem 2, this strategy yields additional *<sup>O</sup>* (log <sup>Δ</sup>) messages in expectation by applying the Existence Protocol from Theorem 1 for identifying a filter violation. Violations can occur until we have found the correct separation between the kth and (<sup>k</sup> + 1)st largest value, which takes at most log <sup>Δ</sup> steps, because by choosing the median value, we essentially perform a binary search for the correct value. Note, that since the adversary is offline adaptive, it is easy to see that every online algorithm needs at least <sup>Ω</sup>(log <sup>Δ</sup>) messages which easily translates to an overall lower bound of <sup>Ω</sup>(<sup>k</sup> + log <sup>n</sup> + log <sup>Δ</sup>) on the competitiveness for any randomized online algorithm. By this, the bound in Theorem 6 is asymptotically tight.

While this strategy performs generally well under minimal changes to the input values, a lot of communication can occur if, e.g., the nodes holding the <sup>k</sup>th and (<sup>k</sup> + 1)st largest values often switch positions, but these values are almost the same. In such a situation, it might be sufficient not to take note of the exact Top-k (e.g., for outdoor temperature one degree differences might not matter to us). We address this by proposing an algorithm calculation positions for Approximate Top-k as by Definition 2.

#### 4.2 Filter-Based Approx. Top-*k*-Position Monitoring

In this section we allow the online algorithm to have some errors in its output and compare against an optimal offline algorithm which solves the exact problem. Recall that monitoring the Approximate Top-k-Positions allows (only) the online algorithm to choose nodes as an output which are 'close' to the kth largest value (see Definition 2). Observe that filters are allowed to overlap if we consider the relaxation of the Top-k-Position problem.

We want to make use of the allowed error in the following way: When solving the exact problem, we had to search the value domain for the correct separation between the <sup>k</sup>th and (<sup>k</sup> + 1)st ranked value. Allowing an error means that we only need to find an approximation of this separating value, resulting in a faster search. In fact, if we introduce an additive error (say M), it is easy to see that the competitiveness compared to a filter-based offline algorithm which solves the exact problem is reduced from *<sup>O</sup>* (k+log <sup>n</sup>+log <sup>Δ</sup>) to *<sup>O</sup>* (k+log <sup>n</sup>+log(Δ−M)).

However, if we use the standard notion of a multiplicative error the following disadvantage occurs: If the values we search for are smaller, the range of values which lie within the margin of error also becomes smaller. So in a way, the criterion for a valid outputs becomes stricter when dealing with smaller values.

To circumvent this shortcoming, we first apply a binary search strategy on a logarithmic scale which terminates after log log <sup>Δ</sup> filter violations and stops with the property that the allowed error can only vary within constant factors. Applying the approach from the algorithm in Theorem 6 with an early stopping rule, the following can be achieved:

Theorem 7 ([10 SPP]). *Let each sensor node observe values from* 1,...,Δ*. There is an online algorithm which monitors the Approximate Top-*k*-Positions with a competitiveness of <sup>O</sup>* (<sup>k</sup> + log <sup>n</sup> + log log <sup>Δ</sup> + log 1/ε) *compared to a filterbased offline algorithm which monitors the exact Top-*k*-Positions.*

# 4.3 Approximate Offline Algorithm

In this section, we study a variant in which the optimal offline algorithm is allowed to introduce an error, i.e., both the online and offline algorithms monitor the Top-k-Positions approximately. It turns out that it is much more challenging for online than for offline algorithms to take advantage of the relaxed conditions for a correct output, resulting in a significantly higher competitive ratio. This fact is formalized in a lower bound of <sup>Ω</sup>(n) (for constant <sup>k</sup>) [9 SPP], which is much larger than previous upper bounds of *<sup>O</sup>* (<sup>k</sup> + log <sup>n</sup> + log <sup>Δ</sup>) for the exact problem. Intuitively speaking, the online algorithm has to choose where to set filters, but also has to choose a subset of nodes the output is based on which significantly increases the lower bound:

Theorem 8 ([10 SPP]). *Any filter-based online algorithm which solves the approximate Top-*k *Position Monitoring problem cannot be better than* <sup>Ω</sup>(<sup>n</sup> + log <sup>Δ</sup>)*-competitive.*

We consider two settings in which we compare to an approximate offline algorithm and design algorithms for the respective settings: First, an online algorithm has the task to solve the problem with the same error ε as the offline algorithm and second, an online algorithm is allowed to use 2ε, i.e., twice the error of the offline algorithm.

For the first setting, the online algorithm is allowed to make use of the same error <sup>ε</sup> as the offline algorithm, which results in a competitiveness of *<sup>O</sup>* (n<sup>2</sup> log <sup>Δ</sup>) (assuming reasonable values of ε or simply assuming to be constant). Intuitively speaking, in this scenario the online algorithm has to solve two questions at the same time: The bounds of the filter intervals, and the choice of the subset of nodes for the output.

Theorem 9 ([9 SPP]). *Assuming* ε *is a constant, there is an online algorithm for the approximate Top-*<sup>k</sup> *Position Monitoring problem which is <sup>O</sup>* (n<sup>2</sup> · log <sup>Δ</sup>) *competitive.*

This interaction between the two questions leads to a gap between the lower bound and the upper bound stated above. To reduce the power of the adversary but still to consider the problem of choosing a subset of nodes for the output, we consider an augmented version which allows the online algorithm to use an error of 2<sup>ε</sup> compared to <sup>ε</sup> in the offline algorithm. The algorithm is *<sup>O</sup>* (n)-competitive (again with reasonable assumptions on ε and also on the relation of n and Δ). In this setting with a constant number of filter violations it is possible to argue on the placement of filters and thus the combination of filter placement and the subset of the nodes do not take that much of a role expressed in the following:

Theorem 10 ([9 SPP]). *Assuming* <sup>ε</sup> *is a constant and* log <sup>Δ</sup> = *<sup>O</sup>* (n)*, there is an online algorithm for the approximate Top-*k *Position Monitoring problem which is <sup>O</sup>* (n)*-competitive against an optimal offline algorithm using an error of* 2<sup>ε</sup> *compared to the error of* <sup>ε</sup> *of the offline algorithm.*

# 5 (Approximate) Count Distinct Monitoring

In the following section, we consider the Count Distinct Monitoring problem where the server is tasked to count how many different values are observed at the sensors. More specifically, we establish an (ε, δ)-approximation of the number of distinct values d<sup>t</sup> at time step t. On a high level, our approximation scheme shows how one can combine both a filter-based approach together with a sampling technique to shrink the required communication. Due to space constraints, we only explain our techniques on a high level. For details we refer to [1 SPP].

The key idea for estimating d<sup>t</sup> is to follow a sampling approach on the values (not on the nodes). We create a sample out of all values and use the Existence Protocol (Theorem 1) to identify a representing node for each sampled value, i.e., one node per value of the sample set observing the value. Then, we monitor the identified representing nodes to keep track of d<sup>t</sup> over time. For the monitoring, a filter-based approach is utilized, allowing us to compare the communication volume of our protocol to a minimal filter-based one as already done in previous sections.

We are able to achieve an (ε, δ)-approximation that is kept valid for multiple time steps depending on how much the values change in consecutive time steps (parameterized by σ). Using the filter-based approach, our analysis relates to the number of messages exchanged by an optimal filter-based approach (R∗). In total, we arrive at the theorem below.

Theorem 11 ([1 SPP]). *There is an* (ε, δ)*-approximation for the Count Distinct Monitoring problem for* <sup>T</sup> *time steps that uses <sup>O</sup>* ((σ+δ) <sup>R</sup><sup>∗</sup> log <sup>n</sup> <sup>d</sup>*<sup>t</sup>* <sup>T</sup> <sup>1</sup> <sup>ε</sup><sup>2</sup> log <sup>1</sup> δ ) *messages. Here, the change in the number of nodes observing a fixed value between consecutive time steps is upper bounded by a constant factor* <sup>σ</sup> <sup>≤</sup> 1/2 *and* R<sup>∗</sup> *is the minimum number of changes of representatives for a given input.*

The bound stated above is comprised of different aspects which are reflected by factors stemming from a sampling approach <sup>Θ</sup>(1/ε<sup>2</sup> log 1/δ), the fact that the number of domain changes is bounded <sup>Θ</sup>(<sup>σ</sup> + <sup>δ</sup>) and the competitiveness of monitoring the representative for one domain ( *<sup>O</sup>* (log <sup>n</sup> · <sup>R</sup><sup>∗</sup> <sup>v</sup>) ) with respect to R<sup>∗</sup> <sup>v</sup>, the number of representatives to monitor that a value v was observed used by an optimal offline algorithm.

The bound of the algorithm can also be expressed as *<sup>O</sup>* (log <sup>n</sup>·R<sup>∗</sup> <sup>S</sup> ·1/ε<sup>2</sup> log 1/δ) where R<sup>∗</sup> <sup>S</sup> denotes the optimal number of representatives for the sample set S throughout the time period T. Furthermore, focusing on the aspect of dynamic algorithms, the bound can also be expressed as *<sup>O</sup>* ((<sup>σ</sup> +δ)·<sup>T</sup> · 1/ε<sup>2</sup> log 1/δ). Note that these bounds are different bounds for the same algorithm and only reflect different input sequences more properly.

#### 5.1 Computation for One Time Step

The computation for one time step takes place in two phases. First, a constant factor approximation for d<sup>t</sup> is created. In the second phase, the constant approximation is used to determine a sufficiently large probability that is broadcasted to the sensors, which in turn create a sample out of all observed values that is reported to the server. Based on the size of the sample set and the previously calculated probability, the server can estimate d<sup>t</sup> up to a factor of ε with a probability of at least 1 <sup>−</sup> <sup>δ</sup>.

It is crucial here that we do a random experiment for a value, i.e., all sensors observing the same value should see the same outcome of the random experiment. This can be achieved by a *public coin* [7]. A public coin is a random string consisting of fully unbiased bits that is common for all sensor nodes. It can be implemented by having the same pseudorandom number generator at each sensor initialized by a common seed that is broadcasted by the server at the beginning of each phase of the algorithm. Note that such an approach only increases the communication complexity by an additive constant. A set of sensors (observing the same value) is able to do a random experiment together by considering the same substring of the public coin (which is predefined by the value the sensors are observing).

For a constant factor approximation we first let the sensors draw a random number with the public coin based on a geometric distribution, i.e., we generate a random height h<sup>v</sup> for each value v. Then the server triggers a communication of the values of largest height by polling the heights from largest to smallest in synchronous rounds. Thereby, for each value that is communicated, the Existence Protocol is used (cf. Theorem 1) to bring down the number of communicated messages to a constant.

After we have a constant factor approximation, we calculate a probability p which is broadcasted to the sensors. With probability p a value is communicated to the server in the second phase. Whether or not a value is communicated is again decided for all sensors observing the value using the public coin. For each of the values selected in this phase, the Existence Protocol is used again (cf. Theorem 1) to identify a representing sensor. Such a representing sensor witnesses that the sampled value is observed. p is chosen with respect to ε, δ and the constant factor approximation such that the server can compute an (ε, δ)-approximation based on the number of received values of the second phase.

In the end, most of the communication happens due to the chosen probability to have a sufficiently large sample of the observed value. Thus, we arrive at the theorem below.

Theorem 12 ([1 SPP]). *There is an* (ε, δ)*-approximation algorithm for the Count Distinct Monitoring problem for one time step using <sup>O</sup>* (1/ε<sup>2</sup> log 1/δ) *messages on expectation.*

## 5.2 Monitoring over Multiple Time Steps

In the worst case values might change arbitrarily between multiple consecutive time steps and a sensor that was used as a representative for a value might not be of use even after a single round. However, as argued before, one expects based on practical scenarios that the values that are observed at a fixed sensor are similar in consecutive time steps. To analyze the quality of our algorithm with regard to the significance of changes in consecutive time steps, we use a filter-based approach. The idea is to reuse the results of the (relatively) costly computation of one time step for consecutive time steps as long as the values are similar to a certain degree. The filter is implemented by the representing sensors that are identified, i.e., we compare how many times our protocol has to identify such a representing sensor compared to how many times an optimal filter-based algorithm has to do this (R∗).

Recapitulate that using a public coin, a sample set of values was determined. The server keeps track of the sample after an initial (ε, δ)-approximation is done. Thereby, any sensor sends a message to the server if it observes a value in the sample that has not been observed previously. The server estimates based on such messages how many values are in total newly observed. Similarly, if a representative for a value in the sample stops observing the latter, a new representative is searched (using the Existence Protocol, cf. Theorem 1) and if none is found, the server estimates how many values left in total. Since any filter-based algorithm has to communicate at some point when an optimal representative sensor stops observing its value, our result depends on the minimum possible number of such changes R<sup>∗</sup> as it can be seen in Theorem 11.

# 6 Conclusion

In this work we elaborated on models for dynamic input sequences and designed and analyzed algorithms which handle these settings. The respective bounds reflect this by comparing the communication to an optimal filter-based algorithm or by introducing parameters expressing how 'fast' an instance changes from time step to time step. We have also shown that there is an algorithm which combines these two techniques properly.

As a next step it would be interesting to see how these techniques perform in the presence of sliding windows. The fact that sensors are not capable of storing the entire history of the data stream has an influence on the output quality or the number of messages the sensors need to send to the server, although these values might not be relevant for the current time step.

Another aspect on the input streams might have a significant impact on communication bounds: Assuming the streams have a structured property, e.g., be provided by some random process and thus might be assumed to generate similar observations in consecutive rounds at a respective sensor node. With such an assumption in mind we assume to get bounds in return which reflect the communication complexity to be proportional in the ability of projecting future observations based on past observations.

# References

	- 2. Borodin, A., El-Yaniv, R.: Online Computation and Competitive Analysis. Cambridge University Press, Cambridge (1998)
	- 3. Cormode, G.: The continuous distributed monitoring model. SIGMOD Rec. 42(1), 5–14 (2013). https://doi.org/10.1145/2481528.2481530
	- 5. Jain, A., Chang, E.Y., Wang, Y.: Adaptive stream resource management using kalman filters. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004, pp. 11–22 (2004). https://doi.org/10.1145/1007568.1007573
	- 6. Karlin, A.R., Manasse, M.S., Rudolph, L., Sleator, D.D.: Competitive snoopy caching. Algorithmica 3(1), 77–119 (1988). https://doi.org/10.1007/ BF01762111
	- 7. Kremer, I., Nisan, N., Ron, D.: On randomized one-round communication complexity. Comput. Complex. 8(1), 21–49 (1999). https://doi.org/10.1007/ s000370050018

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Energy-Efficient Scheduling**

Susanne Albers(B)

Technische Universität München, Munich, Germany albers@in.tum.de

**Abstract.** We review algorithmic techniques for energy conservation in processing environments handling big data sets. Firstly, we address dynamic speed scaling, where processors can run at variable speed/frequency. The goal is to use the speed spectrum of the processors so as to minimize energy consumption while providing a desired service. Here we focus on multi-processor platforms with heterogeneous CPUs. Secondly, we examine power-down mechanisms where idle devices can be transitioned into low-power standby and sleep states. We consider power-down mechanisms in massively parallel systems, where the components have to coordinate their active and idle periods. In particular we focus on data centers with homogeneous as well as heterogeneous servers.

**Keywords:** Approximation algorithm · Competitive analysis · Dynamic speed scaling · Homogeneous processors · Online algorithm · Polynomial-time algorithm · Power-down mechanisms · Power-heterogeneous processors

# **1 Introduction**

The processing of big data sets crucially depends on powerful hardware environments. Big data is typically processed in data and computing centers. Nonetheless, today even a single PC can solve problems with data volumes that were considered huge just a few years ago. In addition to speed, energy consumption has become a major concern in computing environments. Information and communications technology (ICT) systems consume a significant amount of energy. Currently personal computers, data centers and communication networks use 5–9% of the total electricity worldwide [10,31,34]. It is anticipated that electricity used by ICT could exceed 20% of the global total by 2030. Data centers consume about 200 terawatt hours per year, which corresponds to 1.5% of the global electricity demand [10,34]. This is more than the energy consumption of many (European) countries.

At the heart powerful hardware environments consist of processing units such as servers, PCs and – at the bottom level – CPUs. They may operate separately and sequentially but in most cases form parallel and, in particular, massively parallel systems. Nowadays standard PCs and laptops are equipped with multicore architectures. Moreover, in computing and data centers the available processors are interconnected so that hundreds or thousands of CPUs can work on the same application.

In this chapter we will review algorithmic techniques for energy savings in hardware and, in particular, processor systems. The study of such approaches has received quite some interest over the past 15 years, see e.g., [3,14,21,32] and references therein. Essentially, there exists two general techniques towards an energy conservation in processor systems.

(1) *Dynamic speed scaling*: Many modern microprocessors can run at variable speed of frequency. Examples are the Intel Speed Step and the AMD Power Now! processors as well as the VIA Technologies LongHaul CPUs and the AsAP 1 chips. The speed changes are implemented at the hardware level and the operating system level. High processor speed implies high performance. However, the higher the speed the higher the energy consumption is. The goal is to use the full speed/frequency spectrum of a processor so as to minimize the overall energy consumption, while providing a certain service.

(2) *Power-down mechanisms*: A well-known technique for energy savings is to transition a given system – such as the display of a desktop, a laptop, or simply a CPU – into a standby or hibernate mode if it has been idle for a while. The design of power-down strategies becomes particularly challenging in multi-processor environments, where the active and idle periods of the components have to be coordinated so that the system can satisfy a desired processing demand.

In dynamic speed scaling, energy is conserved by optimally exploiting the speed spectrum of processors. Power-down mechanisms reduce energy consumption by transitioning idle systems into low-power sleep states. In the following sections we address both of the above techniques, focusing on results that were achieved within our project of the SPP 1736.

# **2 Dynamic Speed Scaling**

Dynamic speed scaling has been studied extensively in the algorithms community. Prior work has considered single-processor environments as well as multi-processor platforms with homogeneous CPUs. In this context a fundamental algorithmic optimization problem was introduced in a seminal paper by Yao, Demers and Shenker [39]. Specifically, we are given a single variable-speed processor. If the processor runs at speed *s*, then the required power is (proportional to) *f*(*s*) = *s*α , where α *>* 1 is a constant. In practice, α is typically a small value in the range [2*,*3]. In fact the cube-root rule for CMOS devices states that the speed *s* of a processor is proportional to the cube-root of the power or, equivalently, that power is proportional to *s*3. Obviously, when considering a time horizon, energy consumption is power integrated over time.

Yao et al. [39] define a deadline-based scheduling problem. We are given a sequence σ = *J*1*,...,Jn* of jobs, where each job *Jj* is specified by a release time *rj*, a deadline *dj* and a work volume *wj*. If a job *Jj* is processed at fixed speed *s*, then it takes *wj/s* time units to complete the job. Preemption of jobs is allowed. The goal is to find a feasible schedule, respecting the deadline constraints, that minimizes the total energy consumption. For simplicity it is assumed that a processor can run at any speed. In particular, there are no upper and lower bounds on the speeds. Also speed changes are instant. Yao et al. [39] prove that the offline variant of the problem, where all jobs are known in advance, is polynomially solvable.

In the online variant of the problem, the jobs are revealed at their release time. At any time a scheduling algorithm has to make a decision without knowledge of any future jobs. Given a job sequence σ, let *A*(σ) denote the energy consumed by *A* on σ and let *OPT*(σ) be the minimum energy consumption required for σ. Online algorithm *A* is called *c*-competitive [38] if there exists a constant *d* such that *A*(σ) ≤ *c* · *OPT*(σ) +*d* holds for every job sequence σ [38]. The constant *d* must be independent of σ. We remark that, for the results presented in this article, the stated competitive ratios hold without an additive constant. Yao et al. [39] devised two elegant online algorithms, called *Average Rate* and *Optimal Available*. They showed that *Average Rate* achieves a competitive ratio of αα 2α<sup>−</sup>1, for any α ≥ 2. Bansal et al. [21] analyzed *Optimal Available* and proved a competitive ratio of αα.

Speed scaling on homogeneous parallel processors, considering again deadlinebased scheduling, was studied in [6,12,23]. It is assumed that job migration is allowed, i.e. whenever a job is preempted, it may be moved to a different processor. Hence, over time, a job may be executed on various processors as long as the respective processing intervals do not overlap. Albers et al. [6] show that the offline problem can be solved optimally in polynomial time using a combinatorial algorithm. Furthermore they extend the algorithm *Optimal Available* and prove a competitiveness of αα . An extension of *Average Rate* attains a competitive ratio of αα 2α<sup>−</sup><sup>1</sup> +1.

#### **2.1 Speed Scaling on Heterogeneous Processors**

In [7 SPP,8 SPP] we present a comprehensive study of dynamic speed scaling in heterogeneous multi-processor environments. This is a very timely problem as data and computing centers typically host a variety of hardware architectures. Prior to our work, Bampis et al. [18] examined a setting where the power functions of all the processors are convex. For the offline problem they devise an algorithm that returns a solution within an additive ε of the optimum and runs in time polynomial in the size of the instance and 1*/*ε. Gupta et al. [29,30] study speed scaling on heterogeneous platforms with the objective to minimize energy and the total flow time of jobs.

In [7 SPP,8 SPP] we focus again on classical deadline-based scheduling and assume that *m* power-heterogeneous processors *P*1*,...,Pm* are given. Let *fp*(*s*), 1 ≤ *p* ≤ *m*, be the power function of processor *Pp*, depending on speed *s*. We consider two classes of functions.


We assume that job preemption and migration is allowed. In the following let *t*<sup>1</sup> *< t*<sup>2</sup> *< ... < tl < tl*<sup>+</sup><sup>1</sup> be the sorted sequence of all possible different release times and deadlines of jobs. Let *Ii* = [*ti,ti*<sup>+</sup>1), for *i* = 1*,...,l*.

#### **2.2 The Offline Problem with General Power Functions**

In a first step we develop an algorithm for the offline problem that is based on linear programming and applies to a wide family of continuous power functions. Our linear program (LP) formulation is more compact than the configuration LP proposed in [18]. The latter one contains an exponential number of variables and requires the use of the Ellipsoid method, which may not be very efficient in practice. Moreover, the formulation in [18] is solvable only for convex functions.

In order to define our LP, let *sLB* and *sUB* be a lower bound and an upper bound on the speed of any processor in an optimal schedule. We could choose *sLB* = *w*min*/*∑*<sup>i</sup>* |*Ii*| and *sUB* = ∑*<sup>j</sup> wj/*min*<sup>i</sup>* |*Ii*|. Given any constant ε *>* 0, we geometrically discretize the interval [*sLB,sUB*] and define the set of discrete speeds

$$D = \{ \mathbf{s}\_{LB}, \mathbf{s}\_{LB}(1+\mathfrak{c}), \mathbf{s}\_{LB}(1+\mathfrak{c})^2, \dots, \mathbf{s}\_{LB}(1+\mathfrak{c})^k \},$$

where *k* = min{*i* | *sLB*(1+ε)*<sup>i</sup>* <sup>≥</sup> *sUB*}. This set contains *<sup>O</sup>*( <sup>1</sup> ε log(*sUB SLB* )) speed levels.

We consider the wide class of continuous power functions satisfying the following invariant. For any small constant ε *>* 0, there exists a small value ε *>* 0 such that *f*((1+ε)*s*) ≤ (1+ε )*f*(*s*) holds for any speed *s* ∈ [*sLB,sUB*]. Intuitively, a small increase in the speed does not increase the power function by too much. In the case of standard power functions we have that ε = (1 + ε)α − 1. Hence ε may depend on ε and the power function; it is not necessarily smaller than 1. We first show that there exists a (1+ε )-approximate schedule such that, at any time, every processor uses a speed level that belongs to *D*.

For the definition of our LP, for each interval *Ii* and each job *Jj* such that *Ii* ⊆ [*rj,dj*), we introduce a variable *xi, <sup>j</sup>,p,s*, which corresponds to the total amount of time that *Jj* is processed during *Ii* on processor *Pp* using speed *s*.

$$\begin{aligned} \min & \sum\_{i,j,p,s} x\_{i,j,p,s} f\_p(s) \\ \text{s.t.} & \sum\_{i,p,s} x\_{i,j,p,s} s \ge w\_j \,\,\forall j \\ & \sum\_{p,s} x\_{i,j,p,s} \le |I\_i| \,\,\forall i, j \\ & \sum\_{j,s} x\_{i,j,p,s} \le |I\_i| \,\,\forall i, p \end{aligned}$$

$$x\_{i,j,p,s} \ge 0 \quad \forall i, j, p, s$$

A solution to the above LP specifies an operation of job *Jj* on processor *Pp* with processing time ∑*<sup>s</sup> xi, <sup>j</sup>,p,<sup>s</sup>* during interval *Ii*. Hence, for each *Ii*, we obtain an instance of the preemptive open shop problem, which can be solved in polynomial time using the algorithm by Gonzalez and Sahni [28].

**Theorem 1.** *There exists an algorithm that produces a* (1+ε )*-approximate schedule in O*(*poly*(*n,m,* <sup>1</sup> ε *,*log(*sUB SLB* )) *time.*

#### **2.3 The Offline Problem with Standard Power Functions**

In this section we focus on standard power functions *fp*(*s*) = *s*α*<sup>p</sup>* , 1 ≤ *p* ≤ *m*. Such functions were considered by Yao et al. [39]. In fact, most of the literature on dynamic speed scaling focuses on this family of functions. As a main result in [7 SPP,8 SPP] we prove that the offline problem can be solved in polynomial time using a fully combinatorial algorithm that is based on repeated maximum flow computations. In a first step we show that there exists an optimal schedule that exhibits four specific properties. These properties will be essential in the design of our algorithm.

First we demonstrate that for any job *Jj*, 1 ≤ *j* ≤ *n*, the processor speeds at which the job is executed are related through the derivative of the power functions. More specifically, if *Jj* is partially executed by processors *Pp* and *Pq* with speeds *sj,<sup>p</sup>* and *sj,q*, respectively, then *f <sup>p</sup>*(*sj,p*) = *f <sup>q</sup>*(*sj,q*). This follows from the convexity of the power functions when analyzing the energy consumed by *Jj* on processors *Pp* and *Pq*. Therefore, for any job *Jj*, let *Qj* = *f <sup>p</sup>*(*sj,p*) be the *hypopower* on processor *Pp*.

**Property 1**: Each job *Jj* is executed with constant hypopower *Qj*.

The next property implies that, at any time, the available jobs with the greatest hypopower are executed.

**Property 2**: For any pair of jobs *Jj,Jk* and *t* ∈ [*rj,dj*)∩[*rk,dk*) such that *Jj* is executed at time *t* and *Jk* is not executed at *t*, it holds that *Qj* ≥ *Qk*.

We assume that the density δ *<sup>j</sup>* := *wj/*(*dj* − *rj*) of each job *Jj* satisfies δ *<sup>j</sup>* ≥ max*p,q*(α*p/*α*q*)1*/*(α*<sup>q</sup>*−1) . Observe that δ *<sup>j</sup>* is equal to the minimum average speed necessary to complete *Jj* if no other jobs were present. With the assumption on the job densities we can then show that in an optimal schedule, for each job *Jj* and processor *Pp*, the speed *sj,<sup>p</sup>* is at least 1. This allows us to define an order on the processors. We number the processors *P*1*,...,Pm* such that, for any *s* ≥ 1, it holds that *f*1(*s*) ≤ *...* ≤ *fm*(*s*). This implies, α<sup>1</sup> ≤ *...* ≤ α*<sup>m</sup>* and *f* <sup>1</sup>(*s*) ≤ *...* ≤ *f <sup>m</sup>*(*s*). We say that *Pp* is cheaper than *Pq* if *p < q*. The next property states that cheap processors execute, in general, jobs with greater hypopower, compared to expensive processors.

**Property 3**: Let *I* be an interval and *Jj,Jk* be any pair of jobs executed by processors *Pp* and *Pq* during *I*, respectively. If *p < q*, then *Qj* ≥ *Qk*.

The final property states that at each time the cheapest processors are occupied.

**Property 4**: For each interval *Ii*, there exists an *mi* with 0 ≤ *mi* ≤ *m* such that *P*1*,...,Pmi* are occupied throughout *Ii* while *Pmi*<sup>+</sup>1*,...,Pm* are idle.

We proceed with the description of our algorithm. To this end we define problem instances specified by triples (J*,*P*,*I). Here J is a set of jobs, P is a set of processors and <sup>I</sup> is a set of disjoint intervals. Initially, <sup>J</sup> <sup>=</sup> {*J*1*,...,Jn*}, <sup>P</sup> <sup>=</sup> {*P*1*,...,Pm*} and <sup>I</sup> <sup>=</sup> {*I*1*,...,Il*}. In general, during each *Ii* <sup>∈</sup> <sup>I</sup>, there is a subset <sup>J</sup>(*Ii*) <sup>⊆</sup> <sup>J</sup> of *alive* jobs *Jj* with *Ii* <sup>⊆</sup> [*rj,dj*) and a subset <sup>P</sup>(*Ii*) <sup>⊆</sup> <sup>P</sup> of available processors that are unused throughout *Ii*. Let *ni* <sup>=</sup> <sup>|</sup>J(*Ii*)<sup>|</sup> and *ai* <sup>=</sup> <sup>|</sup>P(*Ii*)<sup>|</sup>

Let *<sup>S</sup>*<sup>∗</sup> be an optimal schedule satisfying Properties 1–4. Consider any interval *Ii* <sup>∈</sup> <sup>I</sup>. In Property 4, considering *S*∗, we have *mi* = min{*ni,ai*} because the number of used processors cannot exceed the number of available processors or the number of alive jobs. This equation specifies the exact amount of time, say *tp*, that a processor *Pp* <sup>∈</sup> <sup>P</sup> is used in *S*∗ as well as the corresponding intervals. The most energy-efficient though not necessarily feasible way to schedule the jobs in J is to use the same constant hypopower *Q* satisfying

$$\sum\_{p \in \mathbb{P}} t\_p \left( \frac{\mathcal{Q}}{\alpha\_p} \right)^{\frac{1}{a\_p - 1}} = \sum\_{J\_j \in \mathbb{J}} w\_j.$$

We assume for simplicity that the value of *Q* satisfying the above equation can be computed with arbitrary precision.

If there is a feasible schedule in which all jobs are executed with constant hypopower *Q*, then this schedule is optimal and we are done. As we will explain below, this feasibility problem and the calculation of the corresponding schedule can be solved using a maximum flow computation. If such a feasible schedule does not exist, then (J*,*P*,*I) can be partitioned into two independent subproblems (J≥*Q,*P≥*Q,*I) and (J*<Q,*P*<Q,*I). Here <sup>J</sup>≥*<sup>Q</sup>* and <sup>J</sup>*<<sup>Q</sup>* are the subsets of <sup>J</sup> that are executed with hypopower at least *<sup>Q</sup>* and smaller *<sup>Q</sup>*, respectively, in the optimal schedule *<sup>S</sup>*∗. In each interval *Ii* <sup>∈</sup> <sup>I</sup>, Properties 2 and 3 specify the subsets of available processors <sup>P</sup>≥*Q*(*Ii*)*,*P*<Q*(*Ii*) <sup>⊆</sup> <sup>P</sup> dedicated to the jobs of <sup>J</sup>≥*<sup>Q</sup>* and <sup>J</sup>*<<sup>Q</sup>* that are alive during *Ii*. The jobs of <sup>J</sup>≥*<sup>Q</sup>* occupy the cheapest min{*ai,*|J≥*Q*(*Ii*)|} processors during *Ii*, while the jobs of <sup>J</sup>*<<sup>Q</sup>* use the remaining processors of P(*Ii*).

The feasibility of (J*,*P*,*I) w.r.t. the hypopower *Q* is based on a maximum flow computation in an appropriate network *<sup>N</sup>*(J*,*P*,*I*,Q*). Consider an interval *Ii* <sup>∈</sup> <sup>I</sup> and a processor *Pp* <sup>∈</sup> <sup>P</sup>(*Ii*). If *Pp* runs with hypopower *<sup>Q</sup>* in *Ii*, then its speed is *si,<sup>p</sup>* = (*Q/*α*p*)1*/*(α*<sup>p</sup>*−1) . We slightly abuse notation and let *si,<sup>p</sup>* be the speed of the *p*-th cheapest available processor during *Ii* and P(*Ii*) be the set of the the *mi* cheapest available processors during *Ii*.

In the network, there is a source node *<sup>u</sup>*0, a node *uj* for each *Jj* <sup>∈</sup> <sup>J</sup>, a node *vi,<sup>p</sup>* for each pair of interval *Ii* <sup>∈</sup> <sup>I</sup> and processor *Pp* <sup>∈</sup> <sup>P</sup>(*Ii*), a node *vi* for each interval *Ii* <sup>∈</sup> <sup>I</sup>, and a sink node *<sup>v</sup>*0. The network contains the arc (*u*0*,uj*) with capacity *wj* for each job *Jj* <sup>∈</sup> <sup>J</sup>, the arc (*uj,vi,p*) with capacity (*si,<sup>p</sup>* <sup>−</sup>*si,p*+1)|*Ii*<sup>|</sup> for each interval *Ii*, job *Jj* <sup>∈</sup> <sup>J</sup>(*Ii*) and processor *Pp* <sup>∈</sup> <sup>P</sup>(*Ii*), the arc (*vi,p,vi*) with capacity *<sup>p</sup>*(*si,<sup>p</sup>* <sup>−</sup> *si,p*+1)|*Ii*<sup>|</sup> for each interval *Ii* <sup>∈</sup> <sup>I</sup> and processor *Pp* <sup>∈</sup> <sup>P</sup>(*Ii*) as well as the arc (*vi,v*0) with infinite capacity for each *Ii* <sup>∈</sup> <sup>I</sup>. We set *si,m*+<sup>1</sup> :<sup>=</sup> 0. This is depicted in Fig. 1, was also introduced by Federgruen and Groenevelt [25].

If there does not exist a feasible schedule for (J*,*P*,*I) with hypopower *Q*, then the biseparation into (J≥*Q,*P≥*Q,*I) and (J*<Q,*P*<Q,*I) is based on the following crucial property. Let <sup>J</sup> <sup>⊆</sup> <sup>J</sup>*<<sup>Q</sup>* be any subset of jobs. A job *Jj* <sup>∈</sup> <sup>J</sup> \ <sup>J</sup> belongs to <sup>J</sup>≥*<sup>Q</sup>* if and only if, in the network *<sup>N</sup>*(<sup>J</sup> \ <sup>J</sup> *,*P*,*I*,Q*), there exists a minimum (*u*0*,v*0)-cut that does not contain arc (*u*0*,uj*). This allows us to identify <sup>J</sup>≥*<sup>Q</sup>* and <sup>J</sup>*<Q*. The technical details are omitted here. In summary Algorithm 1 show a pseudocode description of our strategy. The following theorem gives the main result.

**Theorem 2.** *Algorithm 1 generates an optimal schedule and runs in polynomial time O*(*n*4*m*)*.*

#### **2.4 An Online Algorithm**

The online algorithm *Average Rate (AVR)*, proposed by Yao et al. [39] for singleprocessor speed scaling with power function *f*(*s*) = *s*α, works with the concept of job

**Fig. 1.** The flow network

**Algorithm 1:** OPT(J*,*P*,*I) Compute the optimum hypopower *Q* for executing (J*,*P*,*I); (J≥*Q,*P≥*Q,*I), (J*<Q,*P*<Q,*I) <sup>←</sup> BISEPARATION(J*,*P*,*I*,Q*); **if** <sup>J</sup> <sup>=</sup> <sup>J</sup>≥*<sup>Q</sup>* **then return** CONSTANTHYPOPOWERSCHEDULE(J*,*P*,*I*,Q*); **<sup>5</sup> else** *<sup>S</sup>*≥*<sup>Q</sup>* <sup>←</sup> OPT(J≥*Q,*P≥*Q,*I); *<sup>S</sup><<sup>Q</sup>* <sup>←</sup> OPT(J*<Q,*P*<Q,*I); **return** *S*≥*<sup>Q</sup>* ∪*S<Q*;

densities. Again, the density δ *<sup>j</sup>* of job *Jj* is equal to δ *<sup>j</sup>* = *wj/*(*dj* −*rj*). Recall that this is the minimum average speed necessary to complete the job if no other jobs were present. At any time *t*, the processor speed *s*(*t*) is set to the accumulated density of active jobs, i.e. *s*(*t*) = ∑*j*:*t*∈[*rj,dj*) δ *<sup>j</sup>*. With this speed profile, available jobs are scheduled according to the Earliest Deadline First policy.

In order to generalize *AVR* to the multi-processor setting, we consider a variation of the above single-processor algorithm, which uses the same processor speed at any time but applies a different job selection rule. Assume w.l.o.g. that all release times and deadlines are integers. Moreover, assume that *r*min = min1≤*j*≤*<sup>n</sup> rj* = 0 and *d*max = max1≤*j*≤*<sup>n</sup> dj* = *T*. We partition the time horizon into unit-length intervals *It* = [*t,t* +1), 0 ≤ *i < T*. For each job *Jj* with *It* ⊆ [*rj,dj*), the algorithm assigns a work volume of δ *<sup>j</sup>* to interval *It*. Then it produces an arbitrary schedule of the total work assigned to *It* using a fixed speed of *s*(*t*) = ∑*j*:*It*⊆[*rj,dj*) δ *<sup>j</sup>* during the whole *It*. This modified algorithm attains the same competitive ratio as the original algorithm *AVR* because both strategies always employ the same speed and consume the same energy.

Next we turn our attention to the setting with multiple heterogeneous processors. Based on the above algorithm variation, we say that a schedule *S* is an *AVR-schedule* if, for every job *Jj* and interval *It* ⊆ [*rj,dj*), the total amount of work of *Jj* executed during *It* on all the processors in *S* is equal to δ*<sup>j</sup>*. We prove that, for each input sequence σ = *J*1*,...,Jn*, there exists a feasible AVR-schedule *SAVR* on heterogeneous processors with general power functions, as described in Sect. 2.2, whose energy consumption is at most max*<sup>p</sup> cp* +1 times that of the optimum schedule for σ. Here *cp* is the competitive ratio of the single-processor *AVR* algorithm when executed on processor *Pp* with power function *fp*(*s*).

We are ready to describe our algorithm *H-AVR* for heterogeneous processors. The main idea is to generate a (1 + ε)-approximate AVR-schedule using the LP-algorithm described in Sect. 2.2. More specifically, given the assignment of work into intervals implied by the definition of AVR-schedules, for each interval *It* = [*t,t* + 1) we compute an offline (1+ε)-approximate schedule for this subinstance of the heterogeneous speed-scaling problem.

**Theorem 3.** *H-AVR is* (1 + ε)(max*<sup>p</sup> cp* + 1)*-competitive for speed scaling with heterogeneous processors, where cp is the competitiveness of the single-processor AVR algorithm when applied to processor Pp with general power function fp*(*s*)*.*

**Corollary 1.** *H-AVR is* (1 + ε)(αα 2α<sup>−</sup><sup>1</sup> + 1)*-competitive for speed scaling with heterogeneous processors having standard power functions.*

# **2.5 Further Results**

We briefly review work by postdoctoral scientists when they were funded within our project. Article [19] explores dynamic speed scaling, assuming that job preemptions are not allowed. In some applications it might not be feasible or too expensive to interrupt and later resume the execution of a job. For the setting with a single processor, we develop a polynomial time algorithm achieving an improved approximation guarantee of (1 + ε)α *B*α, where *B*α is a generalization of the Bell number [19]. For multi-processor environments we develop the first approximation algorithm for the fully power-heterogeneous setting, where each processor *Pp* has an individual power function *fp*(*s*) = *s*α*<sup>p</sup>* . The performance factor is equal to *B*α((1+ε)(1+*w*max*/w*min))α . Here *w*max and *w*min are the maximum and minimum work volumes of the jobs. Again α = max1≤*p*≤*<sup>m</sup>* α*p*.

In [11] we examine the scenario where jobs must be executed subject to an energy budget. The goal is to maximize the throughput. As a main result we develop polynomial time algorithms based on dynamic programming. In [26] we introduce the new problem of scheduling jobs over scenarios. In [27] we study a dynamic market scheduling problem where an intermediary interacts with an unknown sequence of agents.

# **3 Power-Down Mechanisms in Data Centers**

Power-down strategies for a single device have been investigated by Irani et al. [33] and Augustine et al. [17]. The goal is to minimize the energy consumed in an idle period when the device is not in use. In our work we focus on power-down mechanisms in massively parallel systems and, in particular, data centers.

Energy management is a key issue in data center operations [24]. Electricity costs are a dominant and rapidly growing expense in such centers; about 30–50% of their budget is invested into energy. Surprisingly, the servers of a data center are only utilized 20–40% of the time on average [16,22]. When idle and in active mode, they consume about half of their peak power. Hence a fruitful approach for energy conservation and capacity management is to transition idle servers into standby and sleep states. Servers have a number of low-power states [1]. However state transitions, and in particular power-up operations, incur energy/cost. Therefore, dynamically matching the varying demand for computing capacity with the number of active servers is a challenging problem.

#### **3.1 Heterogeneous Servers**

In [4 SPP,5 SPP] we formulate and study an optimization problem that arises in the energy management of data centers, hosting a large number of heterogeneous servers. Each server has an active state and several standby/sleep states with individual power consumption rates. The demand for computing capacity varies over time. Idle servers may be transitioned to low-power modes so as to rightsize the pool of active servers. The goal is to find a state transition schedule for the servers that minimizes the total energy consumed. On a small scale the same problem arises in multi-core architectures with heterogeneous processors on a chip. One has to determine active and idle periods for the cores so as to minimize the consumed energy.

More formally, we define the optimization problem *Dynamic Power Management (DPM)*. A problem instance *I* = (S*,*D) is specified by a set of servers and varying computing demands over a time horizon. Let <sup>S</sup> <sup>=</sup> {*S*1*,...,Sm*} be a set of *heterogeneous servers*. Each server *Si*, 1 ≤ *i* ≤ *m*, has an active state as well as one or several standby/sleep states. The states of *Si* are denoted by *si,*0*,...,si,*σ*i* . Here *si,*<sup>0</sup> is the active state and *si,*1*,...,si,*σ*<sup>i</sup>* are the low-power states. The modes have individual power consumption rates. Let *ri, <sup>j</sup>* be the power consumption rate of *si, <sup>j</sup>*, i.e., *ri, <sup>j</sup>* energy units are consumed per time unit while *Si* resides in *si, <sup>j</sup>*. The states are numbered in order of decreasing rates such that *ri,*<sup>0</sup> *> ... > ri,*σ*<sup>i</sup>* ≥ 0. A server can transition between its states. Let Δ*<sup>i</sup>, <sup>j</sup>, <sup>j</sup>* be the non-negative energy needed to move *Si* from state *si, <sup>j</sup>* to state *si, <sup>j</sup>* , for any pair 0 ≤ *j, j* ≤ σ*<sup>i</sup>*. The transition energies satisfy the triangle inequality, i.e., the energy to move directly from *si, <sup>j</sup>* to *si, <sup>j</sup>* is upper bounded by that of visiting an intermediate state *si,k*. Formally, Δ*i, j, j* ≤ Δ*<sup>i</sup>, <sup>j</sup>,<sup>k</sup>* +Δ*<sup>i</sup>,k, <sup>j</sup>* .

Over a time horizon the computing demands are given by a *demand profile* D = (*T,D*). Tuple *T* = (*t*1*,...,tn*) contains the points in time when the computing demands change. There holds *t*<sup>1</sup> *< t*<sup>2</sup> *< ... < tn* so that the time horizon is [*t*1*,tn*). Tuple *D* = (*d*1*,...,dn*−1) specifies the demands. More precisely, *dk* <sup>∈</sup> <sup>N</sup><sup>0</sup> servers are required for computing during interval [*tk,tk*<sup>+</sup>1), for any 1 ≤ *k* ≤ *n*−1. Thus at least *dk* servers must reside in the active state during [*tk,tk*<sup>+</sup>1). We have *dk* ≤ *m*, for any 1 ≤ *k* ≤ *n*−1, so that the requirements can be met.

Given *I* = (S*,*D), a *schedule* Σ specifies, for each *Si* and any *t* ∈ [*t*1*,tn*), in which state server *Si* resides at time *t*. Schedule Σ is *feasible* if during any interval [*tk,tk*<sup>+</sup>1) at least *dk* servers are in the active state, 1 ≤ *k* ≤ *n*−1. The energy *E*(Σ) incurred by Σ is the total energy consumed by all the *m* servers. Whenever server *Si*, 1 ≤ *i* ≤ *m*, resides in state *si, <sup>j</sup>* it consumes energy at a rate of *ri, <sup>j</sup>*. Whenever the server transitions from state *si, <sup>j</sup>* to state *si, <sup>j</sup>* , the incurred energy is Δ*<sup>i</sup>, <sup>j</sup>, <sup>j</sup>* . The goal is to find an *optimal schedule*, i.e., a feasible schedule Σ that minimizes *E*(Σ). We assume that initially, immediately before *t*1, and at time *tn* all servers reside in the deepest sleep state, i.e. *Si* is in *si,*σ*i* , 1 ≤ *i* ≤ *m*.

In DPM the demand for computing capacity is specified by the number of servers needed at any time. In data centers it is common practice that a number of required servers is determined as a function of the current total workload, ignoring specific jobs. DPM focuses on energy conservation instead of individual job placement. Again, in the active state, a processor has a fixed energy consumption rate. We investigate DPM as an offline problem, i.e. the varying computing demands are known in advance. From an algorithmic point of view it is important to explore the tractability and approximability of the problem. The offline setting is also relevant in practice. Data centers usually analyze past workload traces to identify long-term patterns. The findings are used to specify demands in future time windows.

Given a problem instance *I*, we first characterize optimal solutions. Property 1 below implies that there exists an optimal schedule in which a server never changes state while being in low-power mode. Property 2 states that there exists an optimal schedule executing state transitions only when the computing demands change. A server *powers up* if it transitions from a low-power state to the active state (indexed 0). A server *powers down* if it moves from the active state to a low-power state.


Finally we may assume w.l.o.g. that the power-down energies Δ*<sup>i</sup>,*0*, <sup>j</sup>* are equal to 0, 1 ≤ *i* ≤ *m* and 1 ≤ *j* ≤ σ*i*. If this is not the case we case we can simply fold the powerdown energy Δ*<sup>i</sup>,*0*, <sup>j</sup> >* 0 into the corresponding power-up energy Δ*<sup>i</sup>, <sup>j</sup>,*0.

#### **3.2 Servers with Two States**

In [4 SPP,5 SPP] we first investigate the variant of DPM in which each server *Si* has exactly two states, an active state *si,*<sup>0</sup> and a sleep state *si,*1, 1 ≤ *i* ≤ *m*. As a main result we show that an optimal schedule can be computed in polynomial time using an algorithm that resorts to a min-cost flow computation.

In a first step we argue that we may assume w.l.o.g. that the power consumption rates in the sleep states are equal to 0. If this is not the case and *ri,*<sup>1</sup> *>* 0, for some *i*, then we can subtract *ri,*<sup>1</sup> from both *ri,*<sup>0</sup> and *ri,*1. This changes the energy consumption by a fixed amount of *ri,*1(*tn* −*t*1) over the entire time horizon. To simplify notation let *ri* := *ri,*<sup>0</sup> be the power consumption rate of *Si* in the active state, 1 ≤ *i* ≤ *m*. Moreover, let Δ*<sup>i</sup>* := Δ*<sup>i</sup>,*1*,*<sup>0</sup> be the energy needed to transition *Si* from the sleep state to the active state.

**Fig. 2.** The component *Ci* for server *Si*

In the following let *I* = (S*,*D) be a given problem instance. We develop an algorithm that computes an optimal schedule. Based on Property 2, we focus on schedules that perform state transitions only at the times of *T*. Given *I*, our strategy constructs a flow network *N*(*I*) that we describe in the next paragraphs.

**Network Components.** Network *N*(*I*) contains a *component Ci*, for each server *Si*, 1 ≤ *i* ≤ *m*. Such a component *Ci*, which is depicted in Fig. 2, consists of an *upper path* and a *lower path*. The upper path represents the active state of *Si*; the lower path models the server's sleep state. The computing demands change at the times *t*<sup>1</sup> *<...< tn* in *T*. For any *tk*, 1 ≤ *k* ≤ *n*, there is a vertex *ui,<sup>k</sup>* on the upper path. Vertices *ui,<sup>k</sup>* and *ui,k*+<sup>1</sup> are connected by a directed edge (*ui,k,ui,k*+1) of cost *ri*(*tk*<sup>+</sup><sup>1</sup> −*tk*), 1 ≤ *k* ≤ *n*−1. This cost is equal to the energy consumed if *Si* is in the active state during [*tk,tk*<sup>+</sup>1). Similarly, for any *tk*, 1 ≤ *k* ≤ *n*, there is a vertex *li,<sup>k</sup>* on the lower path. In order to ensure that at least *dk* servers are in the active state during [*tk,tk*<sup>+</sup>1), if *k < n*, we need two auxiliary vertices *l a i,k* and *l b <sup>i</sup>,k*. These vertices are again connected by directed edges. There is an edge (*li,k,l a <sup>i</sup>,k*), followed by two edges (*l a <sup>i</sup>,k,l b <sup>i</sup>,k*) and (*l b <sup>i</sup>,k,li,k*+1), for any *k* with 1 ≤ *k* ≤ *n*−1. The cost of each of these edges is 0 because the energy consumption in the sleep state is 0.

The lower and the upper paths are connected by additional edges that model state transitions. Recall that all servers are in the sleep state at times *t*<sup>1</sup> and *tn*. For any *k* with 1 ≤ *k* ≤ *n* − 1, there is a directed edge (*li,k,ui,k*) of cost Δ*<sup>i</sup>*, representing a power-up operation of *Si* at time *tk*. For any *k* with 1 *< k* ≤ *n*, there is a directed edge (*ui,k,li,k*) of cost 0, modeling a power-down operation of *Si* at time *tk*. The capacity of each edge of *Ci* is equal to 1.

**The Entire Network.** In *N*(*I*) components *C*1*,...,Cm* are aligned in parallel and connected to a source *a*<sup>0</sup> and a sink *b*0. The general structure of *N*(*I*) is depicted in Fig. 3. There is a directed edge from *a*<sup>0</sup> to *li,*<sup>1</sup> in *Ci*, for any 1 ≤ *i* ≤ *m*. Furthermore, there is a directed edge from *li,<sup>n</sup>* to *b*0, for any 1 ≤ *i* ≤ *m*. Each of these edges has a cost of 0 and a capacity of 1. Vertex *a*<sup>0</sup> has a supply of *m*, and *b*<sup>0</sup> has a demand of *m*. Hence *m* units of flow must be shipped through *C*1*,...,Cm*. Since all edges have a capacity of 1, one unit of flow must be routed through each *Ci*, 1 ≤ *i* ≤ *m*. Whenever the unit traverses the upper path, *Si* is in the active state. Whenever the unit traverses the lower path, *Si* is in the sleep state.

In order to ensure that at least *dk* servers are in the active state during [*tk,tk*<sup>+</sup>1), 1 ≤ *k* ≤ *n*−1, we introduce additional sources and sinks. Network *N*(*I*) has a source *ak* and a sink *bk* with supply/demand *dk*, for any 1 ≤ *k* ≤ *n* − 1. There is a directed edge from *ak* to *l a <sup>i</sup>,<sup>k</sup>* on the lower path of each *Ci*, 1 ≤ *i* ≤ *m*. Furthermore, there is a directed

**Fig. 3.** The network *N*(*I*)

edge from each *l b <sup>i</sup>,<sup>k</sup>* to *bk*, 1 ≤ *i* ≤ *m*. The cost and capacity of each of these edges is equal to 0 and 1, respectively. Since *dk* flow units have to be shipped from *ak* to *bk*, there must exist at least *dk* components *Ci* in which the flow unit from *a*<sup>0</sup> to *b*<sup>0</sup> traverses the upper path from *ui,<sup>k</sup>* to *ui,k*<sup>+</sup>1. Hence the corresponding servers are in the active state during [*tk,tk*<sup>+</sup>1).

Obviously, any feasible schedule Σ in which state transitions are performed only at the times of *T* corresponds to a feasible flow of cost *E*(Σ) in *N*(*I*). Unfortunately, the reverse statement is not true. Since *N*(*I*) is a single-commodity flow network, a feasible flow *f* does not necessarily represent a feasible schedule. It may happen that flow shipped out of a source *ak* is not necessarily routed to *bk*, 0 ≤ *k* ≤ *n*−1. In particular, flow leaving *ak* may be routed to a sink *bk* , where *k > k*, or to *b*0. Observe that in *N*(*I*) all edge capacities and supplies/demands are integer values. Hence in *N*(*I*) there exists a minimum-cost flow that is integral, i.e., the flow along any edge takes an integer value. Moreover, there exist polynomial time combinatorial algorithms that compute such an integral minimum-cost flow [2]. In [5 SPP,4 SPP] we prove that any feasible integral flow *f* of cost *C* in *N*(*I*) can be transformed so that it corresponds to a feasible schedule Σ consuming energy *C*. More specifically, using (non-trivial) flow modification operations, we ensure that each network component *Ci* ships exactly on flow unit in each interval [*tk,tk*<sup>+</sup>1). The transformation takes a polynomial number of steps.

**Theorem 4.** *Let I be an instance of DPM in which each server has exactly two states. An optimal schedule for I can be computed in polynomial time by a combinatorial algorithm that uses a minimum-cost flow computation.*

#### **3.3 Servers with Multiple States**

In [4 SPP,5 SPP] we also investigate DPM in the general scenario that each server has multiple sleep states. In this case DPM becomes NP-hard. We extend our approach based on flow computations to design an approximation algorithm. More specifically, we develop a second algorithm that works with a more complex network in which each component has several lower paths, representing the various low-power states of a server. Furthermore, we need a second commodity to ensure that computing demands are met. With only a single commodity, flow units could switch between lower paths at no cost, and infeasible schedules would result.

Given a fractional two-commodity minimum-cost flow, our algorithm executes advanced flow rounding and packing procedures. First, by repeatedly traversing components, the algorithm modifies flow so it becomes integral on the upper paths. Then flow on the lower paths is packed. The final integral flow allows the constructing of a schedule for DPM. Our algorithm achieves an approximation factor of τ, where τ is the number of server types in the problem instance. Specifically, the servers can be partitioned into τ classes such that, within each class, the servers are identical. Of course, the servers of a class are independent and not synchronized. In practice, a data center has a large collection of machines but a relatively small number of different server architectures.

**Theorem 5.** *Let I be an instance of DPM with* τ *server types. A schedule whose energy consumption is at most* τ *times the minimum one for I can be computed in polynomial time based on a min-cost two-commodity flow computation.*

#### **3.4 Homogeneous Servers**

In [9] we investigate another algorithmic problem with the objective of dynamically resizing a data center. Specifically, we resort to a framework that was introduced by Lin, Wierman, Andrew and Thereska [35,37].

Consider a data center with *m* homogeneous servers, each of which has two states, an active state and a sleep state. An optimization is performed over a discrete, finite time horizon consisting of time steps *t* = 1*,...,T*. At any time *t*, 1 ≤ *t* ≤ *T*, a non-negative convex cost function *ft*(·) models the operating cost of the data center. More precisely, *ft*(*xt*) is the incurred cost if *xt* servers are in the active state at time *t*, where 0 ≤ *xt* ≤ *m*. This operating cost captures e.g., the energy cost and service delay, for an incoming workload, depending on the number of active servers.

Furthermore, at any time *t* there is a switching cost, taking into account that the data center may be resized by changing the number of active servers. This switching cost is equal to Δ(*xt* <sup>−</sup>*xt*−<sup>1</sup>)+, where Δ is a positive real constant and (*x*)+ = max(0*,x*). Again we assume that transition cost is incurred when servers are powered up from the sleep state to the active state. A cost of powering down servers may be folded into this cost. The constant Δ incorporates e.g., the energy needed to transition a server from the sleep state to the active state, as well as delays resulting from a migration of data and connections. We assume that at the beginning and at the end of the time horizon all servers are in the sleep state, i.e., *x*<sup>0</sup> = *xT*<sup>+</sup><sup>1</sup> = 0. The goal is to determine a vector *X* = (*x*1*,...,xT* ) called *schedule*, specifying at any time the number of active servers, that minimizes

$$\sum\_{t=1}^{T} f\_l(\mathbf{x}\_t) + \Delta \sum\_{t=1}^{T} (\mathbf{x}\_l - \mathbf{x}\_{t-1})^+. \tag{1}$$

**Fig. 4.** Construction of the graph

All previous work [13,15,20,35–37] on the data-center optimization problem assumes that the server numbers *xt*, 1 ≤ *t* ≤ *T*, may take fractional values. That is, *xt* may be an arbitrary real number in the range [0*,m*]. From a practical point of view this is acceptable because a data center has a large number of machines. Nonetheless, from an algorithmic and optimization perspective, the proposed algorithms do not compute feasible solutions. Important questions remain if the *xt* are indeed integer valued: (1) Can optimal solutions be computed in polynomial time? (2) What is the best competitive ratio achievable by online algorithms?

In [9] we present the first study of the above data-center optimization problem assuming that the *xt* take integer values. In a first step we examine the offline variant of the problem, where the convex functions *ft*, 1 ≤ *t* ≤ *T*, are known in advance. Lin et al. [37] developed an algorithm based on a convex program that computes optimal solutions if fractional values *xt* are allowed.

Considering the discrete setting with integer valued *xt*, we prove that optimal solutions can also be computed in polynomial time. Our algorithm is different from the convex optimization approach by Lin et al. [37]. More precisely, our strategy works with an underlying directed, weighted graph *G* = (*V,E*). Let [*k*] := {1*,*2*,...,k*} and [*k*]<sup>0</sup> :<sup>=</sup> {0*,*1*,...,k*} with *<sup>k</sup>* <sup>∈</sup> <sup>N</sup>. For each *<sup>t</sup>* <sup>∈</sup> [*T*] and each *<sup>j</sup>* <sup>∈</sup> [*m*]0, there is a vertex *vt, <sup>j</sup>*, representing the state that exactly *j* servers are active at time *t*. Furthermore, there are two vertices *v*0*,*<sup>0</sup> and *vT*<sup>+</sup>1*,*<sup>0</sup> for the initial and final states *x*<sup>0</sup> = 0 and *xT*<sup>+</sup><sup>1</sup> = 0. For each *t* ∈ {2*,...,T*} and each pair *j, j* ∈ [*m*]0, there is a directed edge from *vt*−1*, <sup>j</sup>* to *vt, <sup>j</sup>* having weight Δ(*j* <sup>−</sup> *<sup>j</sup>*)+ <sup>+</sup> *ft*(*j*). This edge weight corresponds to the switching cost when changing the number of servers between time *t* −1 and *t* and to the operating cost incurred at time *t*. Obviously, Δ(*j* <sup>−</sup> *<sup>j</sup>*)+ <sup>+</sup> *ft*(*j*) = *ft*(*j*) +Δ(*j* <sup>−</sup> *<sup>j</sup>*)+ so that the edge cost properly represents the cost contribution in the objective function, see (1), at time *t*. Similarly, for *t* = 1 and each *j* ∈ [*m*]0, there is a directed edge from *v*0*,*<sup>0</sup> to *v*1*, <sup>j</sup>* with weight *f*1(*j*)+Δ(*j* )+. Finally, for *<sup>t</sup>* <sup>=</sup> *<sup>T</sup>* and each *<sup>j</sup>* <sup>∈</sup> [*m*]0, there is a directed edge from *vT, <sup>j</sup>* to *vT*<sup>+</sup>1*,*<sup>0</sup> of weight 0. The structure of *G* is depicted in Fig. 4. In the following, for each *j* ∈ [*m*]0, vertex set *Rj* = {*vt, <sup>j</sup>* | *t* ∈ [*T*]} is called *row j*.

A path between *v*<sup>0</sup> and *vT*<sup>+</sup><sup>1</sup> represents a schedule. If the path visits *vt, <sup>j</sup>*, then *xt* = *j* servers are active at time *t*. The total length (weight) of a path is equal to the cost of the corresponding schedule. An optimal schedule can be determined using a shortest path computation, which takes *O*(*T m*) time in the particular graph *G*. However, this running time is not polynomial because the encoding length of an input instance is linear in *T* and log*m*, in addition to the encoding of the functions *ft*. In [9] we present a polynomial time algorithm that improves an initial schedule iteratively using binary search. In each iteration the algorithm constructs and uses only a constant number of rows of *G*.

# **Theorem 6.** *An optimal schedule can be computed in polynomial time O*(*T* log*m*)*.*

In [9] we also examine the online variant of the data center optimization problem where the functions *ft*, 1 ≤ *t* ≤ *T*, are revealed over time. We extend an algorithm *Lazy Capacity Provisioning* proposed by Lin et al. [37] and prove that it achieves a competitive ratio of 3. We also show that this is best possible. No deterministic online algorithm can attain a competitive ratio smaller than 3.

# **References**

	- 6. Albers, S., Antoniadis, A., Greiner, G.: On multi-processor speed scaling with migration. J. Comput. Syst. Sci. **81**(7), 1194–1209 (2015). https://doi.org/10.1016/j.jcss.2015. 03.001
	- 9. Albers, S., Quedenfeld, J.: Optimal algorithms for right-sizing data centers. In: Proceedings of the 30th Symposium on Parallelism in Algorithms and Architectures, SPAA, pp. 363–372 (2018). https://doi.org/10.1145/3210377.3210385
	- 10. Andrae, A.S.G., Edler, T.: On global electricity usage of communication technology: trends to 2030. Challenges **6**(1), 117–157 (2015). https://doi.org/10.3390/challe6010117
	- 11. Angel, E., Bampis, E., Chau, V., Letsios, D.: Throughput maximization for speed scaling with agreeable deadlines. J. Sched. **19**(6), 619–625 (2015). https://doi.org/10.1007/ s10951-015-0452-y
	- 12. Angel, E., Bampis, E., Kacem, F., Letsios, D.: Speed scaling on parallel processors with migration. J. Comb. Optim. **37**(4), 1266–1282 (2018). https://doi.org/10.1007/s10878- 018-0352-0

Symposium on Discrete Algorithms, SODA, Kyoto, Japan, 17–19 January 2012, pp. 1242–1253 (2012). https://doi.org/10.1137/1.9781611973099.98


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **The GENO Software Stack**

Joachim Giesen(B) , Lars Kuehne, and Sören Laue

Friedrich-Schiller-Universität Jena, Jena, Germany {joachim.giesen,lars.kuehne,soeren.laue}@uni-jena.de

**Abstract.** GENO (generic optimization) is a domain specific language for mathematical optimization. The GENO software generates a solver from a specification of an optimization problem class. The optimization problems, that is, their objective function and constraints, are specified in a formal language. The problem specification is then translated into a general normal form. Problems in normal form are then passed on to a general purpose solver. In its Iterations, the solver evaluates expressions for the objective function, constraints, and their derivatives. Hence, computing symbolic gradients of linear algebra expressions is an important component of the GENO software stack. The expressions are evaluated on the available hardware platforms including CPUs and GPUs from different vendors. This becomes possible by compiling the expressions into BLAS (Basic Linear Algebra Subroutines) calls that have been optimized for the different hardware platforms by their vendors. The compiler, called autoBLAS, that translates formal linear algebra expressions into optimized BLAS calls is another important component in the GENO software stack. By putting all the components together the generated solvers are competitive with problem-specific hand-written solvers and orders of magnitude faster than competing approaches that offer comparable ease-of-use. While this article describes the full GENO software stack, its components are of also of interest on their own and thus have been made available independently.

**Keywords:** Constrained optimization · Tensor calculus · BLAS

# **1 Introduction**

GENO makes state-of-the-art performance in solving optimization problems easily accessible. Since optimization problems are ubiquitous in science, engineering and economics, it is not surprising that they come in many different flavors. Traditionally, a main distinction is made between discrete and continuous optimization problems. The focus of GENO is on the continuous case. Prominent examples for classes of continuous optimization problems are linear programs (LPs), quadratic programs (QPs), second-order cone programs (SOCPs), and semi-definite programs (SDPs). For these classes, efficient algorithms and well engineered implementations (solvers) exist for many years. The solvers are typically called from a programming environment. The optimization problems' data are passed to the solver through function calls. It is the responsibility of the programmer to provide the data in the right format, that is compliant to a standard form for the specific problem class. The burden of reformulating the problems in standard form is alleviated by modeling languages that transform a problem specification into standard form. Popular modeling languages are CVX [12,17] for MATLAB and its Python extension CVXPY [2,13], Pyomo [19,20] for Python, and JuMP [14] which is bound to Julia. These languages take an instance of an optimization problem and transform it into some standard form of an LP, QP, SOCP, or SDP, respectively. The transformed problem is then passed to a solver that expects the standard form. However, the transformation can be computationally inefficient, because the representation in standard form can be large in terms of the problem size. Also, the solver is called from within the programming environment only for the given problem instance. The modeling language plus solver approach has been made deployable in the CVXGEN [31], QPgen [16], and OSQP [5] projects. In these projects code is generated for the specified problem class and not just for one problem instance. However, the problem dimensions need to be fixed and the generated code is optimized only for very small or sparse problems. There also exist implementations of the modeling language plus solver approach that are independent from a specific programming environment. Prominent examples are AMPL [15] and GAMS [8] that are popular in the operations research community.

GENO differs from previous work by a much tighter coupling of the language and the solver. GENO does not transform problem instances but whole problem classes, including constrained problems, into a very general standard form. Since the standard form is independent of any specific problem instance it does not grow for larger instances. Hence, the generated solvers can be used like hand-written solvers. They even reach or surpass the efficiency of hand-written solvers for large dense problems. Typically, they are orders of magnitude faster than state-of-the-art modeling language plus solver approaches.

In this article, that is based on the original publications [24,25,28,29], we describe the full GENO software stack. The tight coupling of modeling language and solver is achieved in GENO by computing symbolic gradients that are evaluated by the solver on the given data of the optimization problem. Hence, an important part of GENO's software stack is a facility for computing derivatives of linear algebra expressions. GENO's modeling language allows to specify whole classes of optimization problems in terms of the objective function and constraints that are given as vectorized linear algebra expressions. Neither the objective function nor the constraints need to be differentiable. Non-differentiable problems are transformed into constrained, differentiable problems. A general purpose solver for constrained, differentiable problems is then instantiated with the objective function, the constraint functions and their respective gradients. Using vectorized linear algebra allows a direct mapping onto optimized implementations of BLAS (Basic Linear Algebra Subroutines) routines. BLAS and its close relative LAPACK [3] are the de facto standard for the language independent high performance evaluation of linear algebra expressions. Almost all major hardware vendors provide individual BLAS implementations for their particular hardware, including CPUs (AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30]) and GPUs (NVIDIA cuBLAS [11], AMD clBLAS [4]). GENO supports different hardware platforms through the autoBLAS precompiler that translates linear algebra expressions into optimized BLAS library calls for the addressed hardware.

The GENO software stack comprises a modeling language (Sect. 2), a generic solver (Sect. 3), a matrix and tensor calculus (Sect. 4), and an automatic mapping to BLAS (Sect. 5). The latter three components of GENO's software stack are of interest in a broader context than GENO and hence have been made available independently. GENO is available at www.geno-project.org [27].

# **2 Modeling Language**

GENO's modeling language uses a MATLAB-like syntax for specifying optimization problems. MATLAB is a platform for numerical computations using matrices. The advantages of using matrix expressions are two-fold: First, it allows the user to phrase an optimization problem without the need of specifying the number of variables nor the number of constraints. Hence, the generated solver is not tied to a specific instance but can handle arbitrary-sized problems. Second, it enables direct mappings to BLAS routines that are much more efficient than the corresponding for-loops.

A specification in GENO has four blocks:


See Fig. 1 for some illustrative examples.


**Fig. 1.** A few optimization problems formulated in the GENO modeling language. The problem on the left is an unconstrained optimization problem that computes the Rayleigh quotient, the problem in the middle is the non-negative least squares problem, and the problem on the right shows an -1-norm minimization problem from the domain of compressed sensing over the unit simplex.

GENO's modeling language also allows the specification of non-smooth optimization problems, for instance, problems that employ the norm1 function, that is, the nonsmooth -1-norm. The non-smooth optimization problems that are allowed by GENO can be written as min*x*{max*<sup>i</sup> fi*(*x*)} with smooth functions *fi*(*x*) [36], which is a fairly flexible class that accommodates many of the commonly-encountered non-smooth objective functions. All problems within this class can be transformed into constrained, smooth problems of the form

$$\min\_{t, \mathbf{x}} \ t \quad \text{s.t.} \quad f\_l(\mathbf{x}) \le t \; \forall l.$$

The transformed problems are then solved by a solver for constrained, smooth optimization problems. Hence, within the GENO software stack only a solver for constrained, smooth optimization problems is needed. In the next section we describe the solver that is implemented in the GENO software stack.

# **3 Generic Optimizer**

GENO's generic optimizer employs a solver for unconstrained, smooth optimization problems. This solver is then extended to handle also constraints. The choice for the solver that is implemented within the GENO software stack is motivated by applications in machine learning. Optimization problems in machine learning typically exhibit a few dozen up to a few million variables, and the involved data matrices do not have any special structure and are typically not sparse, that is, at least 10% of the entries are nonzero entries. These properties exclude second-order optimization algorithms and justify our choice to implement a slightly modified version of the L-BFGS-B algorithm [9,44] that can handle smooth optimization problems that have no general constraints, except possibly bound constraints on the variables. It provides a good trade-off between the number of iterations and the complexity per iteration. It also does not assume any structure on the problem data and it is numerically quite robust. On quadratic problems it shares the same convergence guarantees [22,34] as Nesterov's optimal gradient descent method [35] but compared to Nesterov's method it is parameter free, i.e., no parameters need to be tuned or known for the specific problem.

# **3.1 Solver for Bound-Constrained Smooth Problems**

The solver for bound-constrained, smooth optimization problems combines a standard limited memory quasi-Newton method with a projected gradient path approach. In each iteration, the gradient path is projected onto the box constraints and the quadratic function based on the second-order approximation (L-BFGS) of the Hessian is minimized along this path. All variables that are at their boundaries are fixed and only the remaining free variables are optimized using the second-order approximation. Any solution that is not within the bound constraints is projected back onto the feasible set by a simple min/max operation [32]. Only in rare cases, a projected point does not form a descent direction. In this case, instead of using the projected point, one picks the best point that is still feasible along the ray towards the solution of the quadratic approximation. Then, a line search is performed for satisfying the strong Wolfe conditions [41,42]. This ensures convergence also in the non-convex case. The line search also removes the need for a predefined step length parameter. We use the line search proposed in [33] which we enhance by a backtracking line search in case the solver enters a region where the function is not defined.

#### **3.2 Solver for Constrained Smooth Problems**

There are quite a few options for solving smooth, constrained optimization problems. We decided to use the augmented Lagrangian approach [21,38]. It allows to (re-)use our solver for smooth, unconstrained problems, it is fairly robust, and does not need to tune any parameters. The augmented Lagrangian method can be used for solving the following general standard form of an abstract constrained optimization problem

$$\min\_{\mathbf{x}} \ f(\mathbf{x}) \quad \text{s.t.} \quad h(\mathbf{x}) = \mathbf{0} \text{ and } \mathbf{g}(\mathbf{x}) \le \mathbf{0},\tag{1}$$

where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*n*, *<sup>f</sup>* : <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>, *<sup>h</sup>*: <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*m*, *<sup>g</sup>*: <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*<sup>p</sup>* are differentiable functions, and the equality and inequality constraints are understood component-wise.

The augmented Lagrangian of Problem (1) is the following function

$$L\_{\rho}(\mathbf{x}, \lambda, \mu) = f(\mathbf{x}) + \frac{\rho}{2} \left\| h(\mathbf{x}) + \frac{\lambda}{\rho} \right\|^2 + \frac{\rho}{2} \left\| \left( g(\mathbf{x}) + \frac{\mu}{\rho} \right)\_+ \right\|^2,$$

where λ <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* and μ <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* <sup>≥</sup><sup>0</sup> are the Lagrange multipliers, also known as dual variables, ρ > 0 is a constant, · denotes the Euclidean norm, and (*v*)+ denotes max{*v*,0}. The augmented Lagrangian is the standard Lagrangian of Problem (1) augmented by a quadratic penalty term. The quadratic term provides increased stability during the optimization process which can be seen, for example, in the case that Problem (1) is a linear program.

The Augmented Lagrangian Algorithm 1 runs in iterations. Upon convergence, it will return an approximate solution *x* to the original problem along with an approximate solution of the Lagrange multipliers for the dual problem. If Problem (1) is convex, then the algorithm returns the global optimal solution. Otherwise, it returns a local optimum [6]. The update of the multiplier ρ can be ignored and the algorithm still converges [6]. However, in practice it is beneficial to increase it depending on the progress in satisfying the constraints [7]. If the infinity norm of the constraint violation decreases by a factor less than τ = 1/2 in one iteration, then ρis multiplied by a factor of two.

# **4 Matrix and Tensor Calculus**

The solver at the core of GENO's generic optimizer, an implementation of the L-BFGS-B algorithm for bound-constrained smooth problems, runs in iterations. In each iteration expressions for the objective function and its gradients are evaluated. Within GENO these expressions, especially the gradients, have to be made available to the solver. Expressions for objective functions are given in GENO's modeling language that uses a vectorized notation, that is, a notation that avoids explicit indices.

#### **Algorithm 1.** Augmented Lagrangian Method

**Input :** instance of Problem (1) **Output:** approximate solution *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*n*,λ <sup>∈</sup> <sup>R</sup>*p*,μ <sup>∈</sup> <sup>R</sup>*<sup>m</sup>* ≥0 **<sup>1</sup>** initialize *x*<sup>0</sup> = 0, λ<sup>0</sup> = 0, μ<sup>0</sup> = 0, and ρ = 1 **<sup>2</sup> repeat <sup>3</sup>** *xk*+<sup>1</sup> := argmin*<sup>x</sup> L*ρ(*x*,λ*k*,μ*k*) **4** λ*<sup>k</sup>*+<sup>1</sup> := λ*<sup>k</sup>* +ρ*h*(*xk*+1) **5** μ*<sup>k</sup>*+<sup>1</sup> := μ*<sup>k</sup>* +ρ*g*(*xk*+1) + **<sup>6</sup>** update ρ **<sup>7</sup> until** *convergence* **<sup>8</sup> return** *xk*,λ*k*,μ*k*

The advantage of a vectorized notation is that expressions can be mapped more or less directly to BLAS calls and thus to highly optimized BLAS implementations. For GENO we also want this advantage for the gradients. Hence, we need to compute derivatives of matrix expressions. Although computing derivatives of matrix and tensor expressions is a fundamental and frequent task, surprisingly, no algorithm existed that would solve this problem in the general case. In the following, we describe our approach [24,28] that for the first time allowed to compute derivatives of general tensor expressions. It was shown in [24] that evaluating derivatives of non-scalar valued functions computed by this approach is two orders of magnitude faster than previous state-of-the-art approaches when evaluated on the CPU and up to three orders of magnitude faster when evaluated on the GPU. An implementation of our approach is integrated into the GENO software stack. It is also available as a standalone tool at www.MatrixCalculus.org [26].

#### **4.1 Problems with Matrix Notation**

Computing derivatives for scalar functions, i.e., *<sup>f</sup>*(*x*): <sup>R</sup> <sup>→</sup> <sup>R</sup> is a straightforward task and is taught already in high school. One just applies the chain rule repeatedly and the partial derivatives are multiplied together. For instance, consider the function *f*(*x*) = *sin*(*x*2). Its derivative is *f* (*x*) = *cos*(*x*2)· <sup>2</sup> · *<sup>x</sup>*. However, this no longer works in the matrix and tensor case. Compared to the scalar case where only one type of multiplication operator exists, there are several types of multiplication in the matrix and tensor case. It has been shown that 24 types of different multiplications are necessary for representing the derivatives of matrix expressions only in the linear case [37]. Hence, it is essential to find a good representation of matrix and tensor multiplications.

Furthermore, when computing derivatives of vector and matrix expressions, even matrix notation is not sufficient to express all derivatives. For instance, for function *<sup>f</sup>*(*x*): <sup>R</sup>*<sup>n</sup>* <sup>→</sup> <sup>R</sup>*m*, the derivative will be a matrix *<sup>M</sup>* <sup>∈</sup> <sup>R</sup>*m*×*n*. But already its second derivative will be *<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*m*×*n*×*n*, i.e., a third order tensor, which cannot be represented in standard matrix notation. One usually circumvents this by using the vec-operator that maps a matrix to a vector by stacking its columns on top of each other and using the Kronecker product. This way, one can flatten some dimensions and emulate higher order tensors and their multiplications. However, still not all necessary multiplications can be represented this way and it unnecessarily complicates the representation. And even in the two-dimensional case, i.e., when the derivative is a tensor of order two, it might have no corresponding representation as a matrix. For instance, consider the simple quadratic function *f*(*x*) = *x Ax*, where *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is a vector and *<sup>A</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>n</sup>* is a matrix. When computing the derivative of *f* with respect to *x* using the chain rule, one has to compute the derivative of *x* with respect to *x*, i.e., the derivative of a function that maps *x* to its transpose. This is not the identity matrix. In fact, it is not even representable as a matrix. In the more powerful Ricci notation [39] it would be written as the tensor δ*i j*. Hence, the right representation of tensors and operators on them, especially multiplications between them is crucial. In fact, choosing the *right* representation has led to the first general and coherent matrix and tensor calculus theory [24]. Before, only a number of cases could be treated systematically. While the first theory used Ricci notation to represent tensors and their multiplications it turned out that using a generalized form of Einstein notation makes the process of computing derivatives even simpler and more coherent [28].

#### **4.2 Einstein Notation**

In tensor calculus one can distinguish three types of multiplication, namely inner, outer, and element-wise multiplication. Indices are used for distinguishing between these types. For tensors *A*,*B*, and *C* any multiplication of *A* and *B* can be written as

$$C[s\_3] = \sum\_{(s\_1 \cup s\_2) \nmid s\_3} A[s\_1] \cdot B[s\_2],\tag{2}$$

where *C* is the result tensor and *s*1,*s*2, and *s*<sup>3</sup> are the index sets of the left argument, the right argument, and the result tensor, respectively. The summation is over all indices that appear in at least one of the two multiplication's arguments *A* and *B* and are not present in the result tensor *C*. The index set of the result tensor is always a subset of the union of the index sets of the multiplication's arguments, that is, *s*<sup>3</sup> ⊆ (*s*<sup>1</sup> ∪*s*2). In the following we denote the generic tensor multiplication as defined in Eq. (2) simply as

$$C = A \*\_{(s\_1, s\_2, s\_3)} B.$$

This notation is basically identical to the tensor multiplication einsum in NumPy, TensorFlow, and PyTorch, and to the notation used in the Tensor Comprehension Package [40].

Note, that the ∗(*s*1,*s*2,*s*3)-notation comes close to standard Einstein notation. In Einstein notation the index set *s*<sup>3</sup> of the output is omitted and the convention is to sum over all shared indices in *s*<sup>1</sup> and *s*2. However, this restricts the types of multiplications that can be represented. The set of multiplications that can be represented in standard Einstein notation is a proper subset of the multiplications that can be represented by our notation. For instance, standard Einstein notation is not capable of representing element-wise multiplications directly. Still, in the following we refer to the ∗(*s*1,*s*2,*s*3) notation simply as Einstein notation as it is standard practice in many linear algebra packages.

#### **4.3 Tensor Calculus**

In the following, let *<sup>A</sup>* <sup>=</sup> ∑*<sup>s</sup> <sup>A</sup>*[*s*] <sup>2</sup> denote the norm of a tensor *A*. For vectors it coincides with the Euclidean norm and for matrices with the Frobenius norm. The following definition generalizes the standard derivative to the multi-dimensional case.

**Definition 1 (Fréchet Derivative).** *Let f* : <sup>R</sup>*n*1×*n*2×...×*nk* <sup>→</sup> <sup>R</sup>*m*1×*m*2×...×*ml be a function that takes an order-k tensor as input and maps it to an order-l tensor as output. Then, D* <sup>∈</sup> <sup>R</sup>*m*1×*m*2×...×*ml*×*n*1×*n*2×...×*nk is called the derivative of f at x if and only if*

$$\lim\_{h \to 0} \frac{||f(\mathbf{x} + h) - f(\mathbf{x}) - D \diamond h||}{||h||} = 0,$$

*where* ◦ *is an inner tensor product.*

Here, the dot product notation *D*◦ *h* is short for the inner product *D*∗(*s*1*s*2,*s*2,*s*1) *h*, where *<sup>s</sup>*1*s*<sup>2</sup> is the index set of *<sup>D</sup>* and *<sup>s</sup>*<sup>2</sup> is the index set of *<sup>h</sup>*. For instance, if *<sup>D</sup>* <sup>∈</sup> <sup>R</sup>*m*1×*n*1×*n*<sup>2</sup> and *<sup>h</sup>* <sup>∈</sup> <sup>R</sup>*n*1×*n*<sup>2</sup> , then *<sup>s</sup>*<sup>1</sup> <sup>=</sup> {*i*, *<sup>j</sup>*,*k*} and *<sup>s</sup>*<sup>2</sup> <sup>=</sup> { *<sup>j</sup>*,*k*}.

With this definition at hand, we can compute derivatives of matrix and tensor expressions in Einstein notation. As noted in the beginning of this section, derivatives are usually computed using the chain rule. There are two major orderings in which we can apply the chain rule; in a forward fashion and in a reverse fashion. These ways are known as forward mode and reverse mode in the area of algorithmic differentiation (AD, aka. automatic differentiation) [18]. They will both result in the same derivative but not necessarily in the same expression for the derivative. The forward mode coincides with what is usually taught in high school and commonly refers to as symbolic computation of derivatives [23]. Here, we will only describe the reverse mode since this is the mode that is used within the GENO software stack.

Any expression can be represented as a directed acyclic expression graph (expression DAG). Figure 2 shows the expression DAG for the objective function of the logistic regression, i.e.,

$$1^\top \left( \mathbf{y} \odot \log(\exp(X\mathbf{w}) + 1) \right),\tag{3}$$

where denotes the element-wise multiplication.

**Fig. 2.** Expression DAG for Expression (3).

The nodes of the DAG that have no incoming edges represent the variables or constants of the expression and are referred to as input nodes. The nodes of the DAG that have no outgoing edges represent the functions that the DAG computes and are referred to as output nodes. Let the DAG have *n* input nodes (variables) and *m* output nodes (functions). Note, that the DAG in Fig. 2 has only one output node. We label the input nodes as *x*0,...,*xn*−1, the output nodes as *y*0,...,*ym*−1, and the internal nodes as *v*0,...,*vk*−1. Every internal and every output node represents an operator whose arguments are supplied by the incoming edges.

When evaluating the DAG, i.e., computing the function values that the DAG represents for some given input, one proceeds from the input nodes to the output nodes. In forward mode automatic differentiation one proceeds in the same direction for computing the derivative and in reverse mode in reverse order, i.e., from output to input nodes. Each node *vi* will eventually store the derivative ∂ *y j* ∂ *vi* which is usually denoted as ¯*vi*, where *yj* is the function to be differentiated. This partial derivative is often referred to as adjoint. These derivatives are computed as follows: First, the derivatives ∂ *y j* ∂ *yi* are stored at the output nodes of the DAG. Then, the derivatives that are stored at the remaining nodes, here called *z*, are iteratively computed by summing over all their outgoing edges as follows

$$\overline{z} = \frac{\partial \mathbf{y}\_j}{\partial z} = \sum\_{f:(z,f)\in E} \frac{\partial \mathbf{y}\_j}{\partial f} \cdot \frac{\partial f}{\partial z} = \sum\_{f:(z,f)\in E} \overline{f} \cdot \frac{\partial f}{\partial z},\tag{4}$$

where the multiplication is again tensorial. The following theorems specify the type of tensor multiplication for reverse mode Eq. (4). Their proofs can be found in [29].

**Theorem 1.** *Let Y be an output node with index set s*<sup>4</sup> *and let C* = *A* ∗(*s*1,*s*2,*s*3) *B be a multiplication node of the expression DAG. Then the contribution of C to the adjoint B*¯ *of B is <sup>C</sup>*¯ <sup>∗</sup>(*s*4*s*3,*s*1,*s*4*s*2) *A and its contribution to the adjoint A of A is* ¯ *<sup>C</sup>*¯ <sup>∗</sup>(*s*4*s*3,*s*2,*s*4*s*1) *B.*

If the output function *Y* in Theorem 1 is scalar-valued, then we have *s*<sup>4</sup> = 0 and the / adjoint coincides with the function implemented in all modern deep learning frameworks including TensorFlow and PyTorch. Hence, our approach can be seen as a direct generalization of the scalar case.

**Theorem 2.** *Let Y be an output function with index set s*3*, let f be a general unary function whose domain has index set s*<sup>1</sup> *and whose range has index set s*2*, let A be a node in the expression DAG, and let C* = *f*(*A*)*. The contribution of the node C to the adjoint A is* ¯

$$
\bar{f} \*\_{(s\_3s\_2, s\_2s\_1, s\_3s\_1)} f'(A),
$$

*where f is the derivative of f .*

In case that the general unary function is simply an elementwise unary function that is applied element-wise to a tensor, Theorem 2 simplifies as follows.

**Theorem 3.** *Let Y be an output function with index set s*2*, let f be an elementwise unary function, let A be a node in the expression DAG with index set s*1*, and let C* = *f*(*A*) *where f is applied element-wise. The contribution of the node C to the adjoint A is* ¯

$$
\bar{f} \*\_{(s\_2s\_1, s\_1, s\_2s\_1)} f'(A),
$$

*where f is the derivative of f .*

Table 1 shows the individual steps of the reverse mode applied to the expression graph in Fig. 2. Note, that reverse mode manages to compute the derivative of the output function with respect to all input variables in one pass. Again, the last column shows the derivatives in matrix notation when a few simplifications have been applied, like removal of zero and identity tensors. From the first two rows we can read off the derivative of *f* with respect to *X* and the derivative with respect to *w*. The values of the intermediate results and common subexpressions *v*<sup>1</sup> and *v*<sup>2</sup> can be substituted again to obtain the final expression *X* ·(*y* exp(*Xw*) exp(*Xw* + 1)). This expression can then be mapped very easily to a NumPy expression. In the next section, we will discuss how to map such expressions also to different hard- and software backends.

**Table 1.** Individual steps of the reverse mode automatic differentiation of the logistic regression function, i.e., 1 (*y*log(exp(*Xw*) +1)) with respect to all input variables.


# **5 autoBLAS**

GENO aims at providing state-of-the-art performance on a wide variety of backends that include multicore CPUs and GPUs. Hence, it is necessary to generate efficient code for all these backends. This is the purpose of autoBLAS. GENO does not need to directly compile the specification of an optimization problem into executable code but it can map it to an intermediate representation where linear algebra expressions are given as blocks of autoBLAS code. The autoBLAS precompiler then compiles the intermediate code into standard code for the specified backends. autoBLAS itself features an intuitive syntax for linear algebra expressions that is easy to read and comprehend, and delegates the details about their execution to highly-efficient implementations of BLAS routines for the respective backends like for instance AMD Blis [43], Intel MKL [10], Arm Performance Libraries [30], NVIDIA cuBLAS [11], and AMD clBLAS [4].

# **5.1 A Simple autoBLAS Example**

For illustrating autoBLAS, we discuss a minimal example, namely, a matrix-vector product. Listing 1.1 shows a snippet of C++ code initializing a set of *std::vectors* that represent vectors and matrices, followed by a pragma-style declaration of an autoBLAS section. The autoBLAS section first declares two vectors *x* and *y*, and a matrix *A*. Each declaration comes with a set of name-value pairs, like data=x.data() or rows=rows, that describe required properties for generating code to evaluate expressions of the associated variables. The set of supported names and restrictions on values lies in the responsibility of the selected host-language-*context*. Here, the C-language context has been chosen by setting c=c. The currently supported contexts are the C-language, the Eigen library, the NumPy library, and cuBLAS (CUDA).

**Listing 1.1.** Example embedding autoBLAS within C++

```
1 int rows = 10;
2 int col = 20;
3 std :: vector <double > x( rows );
4 std : : vector <double > A( rows ∗ cols );
5 std :: vector <double > y( cols );
6 // init y , A, and x with application specific values
7 #autoblas c=c {
8 Vector x data=x . data ( ) ;
9 Vector y data=y . data ( ) ;
10 M a t r i x A d a t a =A. d a t a ( )
11 rows=rows
12 cols=cols ;
13 y = A' ∗ x ;
14 }
15 / / c o n t i n u e u s i n g x , y , and A i n C++
```
The code in Listing 1.1 is, of course, no valid C++ code and cannot be compiled directly with a standard C++ compiler. In order to get host-language code for the expressions that are stated in embedded autoBLAS sections, the autoBLAS precompiler has to be called first. Listing 1.2 shows how to invoke the autoBLAS precompiler on a file *example.c.in*. In this simple example, the *-b* flag is set to select the target routines for the host-language mappings, here, the standard C-binding *cblas*.

**Listing 1.2.** Compiling autoBLAS code

```
1 $ autoblas −b cblas < example . c . in > example . c
2 $ gcc example . c −o example
```
In our specific example, the autoBLAS precompiler replaces the autoBLAS section in the host-language file with a call to gemv, which is the BLAS routine that computes matrix-vector products [1]. The generated code, shown in Listing 1.3, is now valid C++ code that can be passed to a conforming compiler like gcc.

**Listing 1.3.** C++ code generated by the autoBLAS precompiler

```
1 int rows = 10;
2 int col = 20;
3 std :: vector <double > y( rows );
4 std :: vector <double > A( rows ∗ cols );
5 std :: vector <double > x( cols );
```

```
6 // i n i t y , A, and x with applica ti on s p e c i f i c values
7 cblas_dgemv ( CblasColMajor , CblasTrans , rows , cols , 1.0 ,
8 A. data () , cols , y. data () , 1, 0.0 , x. data () , 1);
9 // c o n t i n u e u s i n g y , A, and x i n C++
```
If we want to generate code for the CUDA backend, then we just have to invoke the autoBLAS precompiler with autoblas -b cuda< example.c.in> example.c. The generated code, shown in Listing 1.4, is now valid C++/Cuda code that again can be passed directly to a conforming compiler.

**Listing 1.4.** C++/CUDA code generated by the autoBLAS precompiler

```
1 int rows = 10;
2 int col = 20;
3 std :: vector <double > y( rows );
4 std :: vector <double > A( rows ∗ cols );
5 std :: vector <double > x( cols );
6 // i n i t y , A, and x with applica ti on s p e c i f i c values
7 cublasHandle_t handle ;
8 c u b l a s C r e a t e (& handle ) ;
9 double ∗ d_x ; cudaMalloc(&d_x , x . s i z e ( ) ∗ sizeof ( double ));
10 cudaMemcpy ( d_x , x . data ( ) , x . size ( ) ∗ sizeof ( double ) ,
11 cudaMemcpyHostToDevice );
12 double ∗ d_y ; cudaMalloc(&d_y , y . s i z e ( ) ∗ sizeof ( double ));
13 double ∗ d_A ; cudaMalloc (&d_A , A. s i z e ( ) ∗ sizeof ( double ));
14 cudaMemcpy ( d_A , A. d a t a ( ) , A. s i z e ( ) ∗ sizeof ( double ) ,
15 cudaMemcpyHostToDevice );
16 const double alpha = 1.0;
17 const double beta = 0.0;
18 cublasDgemv ( handle , CUBLAS_OP_T , rows , c o l s , &alpha , d_A ,
19 cols , d_x , 1 , &beta , d_y , 1 ) ;
20 cudaMemcpy ( d_y , y . data ( ) , y . size ( ) ∗ sizeof ( double ) ,
21 cudaMemcpyDeviceToHost );
22 cudaFree ( d_A ) ;
23 cudaFree ( d_y );
24 cudaFree ( d_x );
25 cublasDestroy ( handle );
26 // c o n t i n u e u s i n g y , A, and x i n C++
```
# **5.2 Design**

By defining an embedded language of its own, autoBLAS is as intuitive to use as task specific frameworks like MATLAB when it comes to expressing *what* to compute. Additionally, by not being bound to a particular programming language, autoBLAS can perform any transformation and optimization necessary on the symbolic level at compile time, even beyond the scope of a single statement. Finally, autoBLAS delegates the task of *how* to evaluate the optimized expressions by generating the corresponding BLAS calls. This allows the user to utilize highly efficient BLAS implementations for the target platform without having to write these calls by hand.

Figure 3 illustrates the three abstract steps of the autoBLAS compiler. The *frontend* is the user-facing part of autoBLAS and comprises both the expression syntax as well as the context selection. The context specifies attributes of the variables like, for instance, their memory layout and the BLAS selections available at compile time.

**Fig. 3.** The design of autoBLAS is divided in three independent components: the user-facing frontend, the optimizing core, and the executing backend.

The *core* implements a set of symbolic optimizations to increase execution performance at runtime, while also allocating and reusing memory for temporaries, if necessary. By performing these syntax-tree optimizations independent of a specific target API, autoBLAS provides a uniform evaluation semantic across different target platforms, hereby, minimizing unpleasant surprises like different operator semantics or optimization behavior when switching between libraries.

The *backend* generates code for the optimized expressions for the respective linear algebra library selected by the caller. Backends define a set of necessary attributes for evaluating expressions into code. For instance, for the cblas [1] backend, a dense matrix is often represented by a data pointer, a storage orientation, the number of rows and columns and the size of the leading dimension. A context is compatible with a specific backend if it provides all necessary attributes for a particular data type.

An advantage of selecting a BLAS-like backend is, that when later profiling the code, the user is able to directly refer to potential bottlenecks. This is in contrast to template-based libraries like Eigen, where the actually called routines are not directly visible and do not correspond to a particular line within the host-language code.

A major benefit of the separation into frontend, core and backend is that extending autoBLAS with a new backend is rather simple and in practice merely requires to derive from a class and implement BLAS-expression to target code mappings. At the same time, a developer who is extending autoBLAS in this way still benefits from all the symbolic optimizations implemented in the autoBLAS core.

# **6 Conclusions**

Making generic optimization (GENO) work efficiently requires several fairly different interoperable software components. In this chapter we have described such components and their integration into the GENO software stack. By carefully designing, implementing and integrating the components in the GENO software we are able to generate optimization code that is competitive with problem-specific hand-written solvers and orders of magnitude faster than competing approaches that are comparably easy to use. Furthermore, the components, specifically, the generic optimizer, the matrix and tensor calculus, and autoBLAS, are of independent interest and are also used in other projects than GENO.

# **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Algorithms for Big Data Problems in de Novo Genome Assembly**

Anand Srivastav, Axel Wedemeyer(B) , Christian Schielke, and Jan Schiemann

#### Kiel University, Kiel, Germany

{srivastav,wedemeyer,schielke,schiemann}@math.uni-kiel.de

**Abstract.** De novo genome assembly is a fundamental task in life sciences. It is mostly a typical big data problem with sometimes billions of reads, a big puzzle in which the genome is hidden. Memory and time efficient algorithms are sought, preferably to run even on desktops in labs. In this chapter we address some algorithmic problems related to genome assembly. We first present an algorithm which heavily reduces the size of input data, but with no essential compromize on the assembly quality. In such and many other algorithms in bioinformatics the counting of k-mers is a botleneck. We discuss counting in external memory. The construction of large parts of the genome, called contigs, can be modelled as the longest path problem or the Euler tour problem in some graphs build on reads or k-mers. We present a linear time streaming algorithm for constructing long paths in undirected graphs, and a streaming algorithm for the Euler tour problem with optimal one-pass complexity.

**Keywords:** De novo genome assembly · Data reduction · Euler tour · Semi-streaming longest path · External memory counting

# **1 Reduction of Input Data in Genome Assembly**

Sequencing is a chemical and physical process in which DNA is 'crushed' into very small parts ('fragments') which are 'read' to strings called reads, containing information of the sequence of nucleotides. Reads are of limited length and contain errors (Fig. 1).

Sequencing of big genomes and other samples is a computationally challenging recent trend for two main reasons:


#### **1.1 Reads, Coverage and Assembly**

A **read** is a string over the alphabet Σ <sup>=</sup> {A,C,G,T,N} where *<sup>A</sup>*,*C*,*G*,*<sup>T</sup>* are the four nuclobases and *N* is a place holder for an unknown nucleotide. The maximal read length depends on the sequencing technology used. For the Illumina sequencing technology the read length initially was 34 and now can be as high as 300. Other sequencers allow

**Fig. 1.** A shortened paired (Illumina) read

for longer reads at the price of a higher error rate (and higher costs). Illumina produces substitution errors with an error rate of roughly 1%. In most cases, so called paired reads are generated where in a first step pieces of DNA of a known length are produced which are than sequenced from both sides (e.g.: a paired read with a read length of 150 contains one string over Σ for the first 150 nucleotides and a second string over Σ for the last 150 nucleotides).

The sequencer also outputs so-called phred scores quantifying the error probability of each nucleotide read (Quality *Q* = −10log10 *P* where *P* is the error probability).

**Genome assembly** (or just assembly) is the task to reconstruct the complete genome of the sequenced species using the reads only (*de novo assembly*) or the reads and a reference genome (*mapping* or *reference based assembly*). It's like a puzzle with millions of small parts, unknown overlaps and a lot of the parts containing errors.

In our work we focus on de novo assembly, or just assembly for the rest of this chapter.

For a sequencing data set, the **coverage** of a genomic position *A* is the number of reads in the data set which contain *A*. The coverage of the whole data set is the average over the coverages of all genomic positions. The empirical 'optimal' coverage for a de novo assembly is about 20 *at every position*. A coverage higher than 20 means redundant data. Some sequencing protocols, especially single cell MDA (multiple displacement amplification), produce read sets with an extreme uneven coverage distribution. Metagenomic data sets may have an uneven coverage distribution, too, when both abundant and rare species are sequenced. Given a string σ over the nucleotide alphabet, a *k***–mer** is a sub-string of σof length *k*.

```
GTCTTTTATAAC
GTCTTT
 TCTTTT
  CTTTTA
   TTTTAT
    TTTATA
     TTATAA
      TATAAC
```
the 6–mers of a string

**Fig. 2.** The cost of sequencing a human genome, source: NIH

Most bacteria can't be cultivated in the lab. Therefore, it is not possible to create a homogeneous sample of thousands or millions of equal cells as in a 'normal' sequencing setting.

As a consequence, **single cell** sequencing protocols, like the multiple displacement amplification (MDA) have been developed which are able to amplify the genome of a single bacterial cell. A drawback of these methods is a strong amplification bias (called 'Preferential amplification' and 'Allelic dropout') between different regions of the genome, meaning that the coverage of some regions of the genome might overshot 100.000*X*, while other regions are not covered at all.

A **metagenome**, introduced by [14], is, according to wiktionary, '*All the genetic material present in an environmental sample, consisting of the genomes of many individual organisms*'. In other words, in a metagenomic experiment, you are interested in


The experiment is conducted by collecting a sample from the desired environment, isolating the DNA from it and sequencing it with a Next Generation Sequencing (NGS) system. There are three different types of metagenomic experiments with different goals:

– *phylogenetic profiling:* based upon the 16S ribosomal RNA found in the sample, reconstruct which families of bacteria live in the probed environment (and how abundant they are). Basis: each (bacterial) cell has ribosomes. The coding genes for these essential proteins are widely conserved (which makes it possible to identify these genes), but they also include less conserved regions which differ between different families or even species.


In our work, we focus on de novo assembly.

The main problem of metagenome assembly is non-uniform coverage: some species in the sample are much more abundant than others. The goal is to assemble all their genomes. The following issues may arise:


# **1.2 The Bignorm Algorithm**

The basic idea of read filtering is to remove reads from a single cell or metagenome data set without losing information, and in this way to reduce the size of the problem, possibly escaping the 'big data curse'. This is possible if only those reads which have overlapping genomic regions with high coverage are removed. A good read filter should remove as many reads as possible, without lowering the coverage of the sequenced genome below the desired threshold at any position and without increasing the error rate of the data set.

Highly memory efficient algorithms are sought to solve this problem. Brown et al. invented an algorithm named *Diginorm* [1] for read filtering that rejects or accepts reads based on the abundance of their *k*–mers. The name *Diginorm* is a short form for *digital normalization*: the goal is to normalize the coverage over all loci, using a computer algorithm after sequencing. Theideais to removethose reads from theinput which mainly consist of *k*–mers that have already been observed many times in other reads. Diginorm processes reads one by one, splits them into *k*–mers, and counts these *k*–mers. In order to save RAM, Diginorm does not keep track of those numbers exactly, but instead keeps appropriate estimates using the count-min sketch CMS [4]. A read is accepted if the median of its *k*–mer counts is below a fixed threshold, usually 20. It was demonstrated that successful assemblies are still possible after Diginorm removed high amount of the data.

Diginorm is a pioneering work. However, the following points, which are important from the biological or computational point of view, are not covered by Diginorm. We have included them in our algorithm called Bignorm [29 SPP]:

(i) we incorporate the important phred quality score into the decision whether to accept or to reject a read, using a quality threshold. This allows a tuning of the filtering process towards high-quality assemblies by using different thresholds.


Let us fix the following parameters:


When our algorithm has to decide whether to accept or reject a read *<sup>i</sup>* <sup>∈</sup> <sup>N</sup>, it performs the following steps: If the number of N symbols counted over all read positions is larger than *N*0, the read is rejected. Otherwise, those parts of the read having phred scores of or above *Q*<sup>0</sup> are converted into a vector *H* of *high-quality k–mers*.

Using the CMS, it is then checked how many times these *k*–mers have been seen in the accepted reads so far (function *<sup>c</sup>*-(μ)) and two counters hold the results: ∈ *H* ; *c*-

$$\begin{aligned} \text{function } c(\mu)) &\text{ and two counts } \mathbf{h}, \\ b\_0 &:= |\{\mu \in H \; ; \; \widehat{c}(\mu) < c\_0\}|, \\ b\_1 &:= |\{\mu \in H \; ; \; c\_0 \le \widehat{c}(\mu) < c\_1\}| \end{aligned}$$

Note that the frequencies are determined via CMS counters and do not consider the position *p* at which the *k*–mer is found in the read string. The read is accepted if and only if at least one of the following conditions is met:

$$b\_0 > k,\tag{1}$$

$$\sum\_{s=1}^{m(i)} b\_1 \ge B. \tag{2}$$

The motivation for condition (1) is as follows. According to [15], most errors of the Illumina sequencing platform are single substitution errors and the probability of appearance of an erroneous *k*–mer in the genome, caused by an incorrect reading of a nucleotide, is quite low. Thus, *k*–mers produced by single substitution errors are likely to have very small counter values in the CMS (less than *c*<sup>0</sup> times) and can be considered as rare *k*–mers. One such error can only effect at most *k k*–mers. So if we count more than *k* rare *k*–mers, they most likely are not a result of one single substitution error. If we assume that the probability of multiple single substitution errors in a read is smaller than the probability of error-free rare *k*–mers, we should accept this read.

Condition (2) says that in the read, there are enough (namely at least *B*) *k*–mers where each of them appears too frequently to be a read error (CMS counters at least *c*0), but not that abundant that it should be considered redundant (CMS counters less than *c*1).

**Algorithm 1:** Bignorm


**Results for Single-Cell Assemblies.** We tested Bignorm on 13 bacterial single-cell data sets and were able to remove up to 90% of the reads without significant loss of the assembly quality. Some results (median of all samples) (Fig. 3):


Bignorm heavily cuts away redundant reads (mean, Fig. 4, left-hand side) but is careful in critical regions (P10, Fig. 4, right-hand side).

**Fig. 3.** Reads kept

**Results for Metagenomic Assemblies.** We tested Bignorm on metagenomic data sets. For data sets with reads of length about 250 base pairs, the results are quite promising and stable. Compared to the single cell case, the results are not that impressive, but compared to the State–of–the–art approach of *sub-sampling* data sets which are too big to be assembled on the given hardware (this means a certain proportion of reads is selected randomly), we could show that by read filtering it is possible to get results which are nearly as good as those of assembling the complete data set, using about the same amount of RAM and in run time as using the sub-sampling approach. The following table gives an impression on the results:


**Fig. 4.** Coverage: mean and critical region

# **2 Counting** *k***–mers in External Memory (EM)**

Many bioinformatics algorithms (e.g., assemblers, error correctors, read normalization) are based on *k*–mers, and that requires to count them (mostly for 21 ≤ *k* ≤ 127). As bioinformatics data sets are growing much faster than RAM sizes, new computational models are needed. (We could show that hash–based counting, which is state of the art in current software, will produce *O* (*n*2) hash table dumps when the number of different *k*–mers is much bigger than the number of slots in the hash table.)

Some examples of recent *k*–mer counting algorithms are:


We need some notations:


$$\stackrel{\cdot}{\cdot}$$


# **2.1 Counting in RAM**

The **straightforward algorithm** for small values of *k*, counting can be done in RAM. If *m* ≤ *R*/*B*, the following *O* (*n*+*m*) algorithm can be used:


For *<sup>k</sup>* <sup>=</sup> 19 and one Byte per counter, 413 <sup>≈</sup> <sup>275</sup>*GB* of RAM is needed.

**Hash–Based Counting.** Most state–of–the–art *k*–mer counting programs are based on hash algorithms using open addressing:


**Theorem 1 (Gallus, Srivastav, Wedemeyer 2021).** *Counting a set of n elements of a population with K different, normally distributed types using a hash table of size h, the expected value of hash table dumps d*(*h*,*n*) *is* 

$$\mathbb{E}[d(h,n)] = n \log\_{(1-\frac{h}{c\_n})} \left(1 - \frac{1}{c\_n}\right),\tag{3}$$

*where cn* <sup>=</sup> *<sup>K</sup>*(1−(1<sup>−</sup> <sup>1</sup> *<sup>K</sup>* )*n*) *gives the number of different (normally distributed) types in a set of size n.*

This formula is the basis for further quantifying the log-term in (3). If one can show that this log-term behaves linearly or sublinearly in *n* in case of including singletons in the the set of *k*-mers, it would match experimental observations. In fact, a constant portion of the *k*–mers can be assumed as sequencing errors of which each occures exactly once.

# **2.2 Counting in External Memory**

**kmc3** is the presently leading program using the external memory model. It works as follows:


Drawback of kmc3: The output files of kmc3 are not completely sorted (due to the introduction of *k* −*x*–mers in kmc2). Therefore,


As a result, even though kmc3 is the fastest EM *k*–mer counter available (and the fastest *k*–mer counter overall under RAM restriction), it is not the perfect choice to be used as a counting module for an EM assembler.

Based on STXXL 1.4.1 [5], in 2018 Christopher Nehls [20] from Kiel University developed a *k*–mer counter called **xsc** which uses a sorting based approach:


For *k* ≤ 32, xsc outperformed jellyfish and was at least competitive to dsk, but kmc3 was always faster. For *k* > 33 (using uint128 and uint256 classes), xsc was not competitive to the existing counters. The main bottleneck of xsc is the overloaded relational operator (operator<).

# **2.3 Counting Using a Bloomfilter**

Roy et.al. [27] stated that *more than 50% of all k–mers in a sequencing data set may be singletons* — which are not of interest as they were probably introduced by errors. To utilise this, their *k*–mer counter Turtle uses a upstream Bloomfilter to save space and time in a sorting based approach named 'sort–and–compact'.

We developed a program which combines the ideas of kmc and the usage of a Bloomfilter, experiments show that the cost of running a bloomfilter is higher than the savings (Fig. 5). What is wrong? Say, we have 100 *k*–mers,


**Fig. 5.** Comparison of run times using or not using a bloomfilter

Our input contains 5050 *k*–mers, the bloomfilter removes 100→≈ 2% of the input, not enough to compensate for the running time of the bloomfilter.

**Our Current Approach.** We have developed the following algorithm which combines sorting and kmc. Experiments are ongoing work:

```
Input: fastq–files as produced by an (Illumina-)sequencer
 Result: k–mers of input and their number of occurrence in the input
1 begin
2 foreach canonical k–mer κ in input do
3 split κ into prefix and suffix
4 use turtle–like sort–and–compact per prefix
5 if an array is full then
6 dump the sorted array to EM
7 merge arrays
```
# **3 A Streaming Algorithm for the Longest Path Problem**

In de novo genome assembly, finding a large genome sequence called contig is the fundamental problem. It can be understood as computing a very long path in the associated graph, for example the de Bruijn graph ([3]). Unfortunately, computing the longest path in a graph is an NP-hard problem and the situation is even more worse if the graph is very large. In this chapter, we present a new algorithm for computing a long path, which is surprisingly competitive with RAM-based algorithms.

Graph streaming is a very efficient concept to handle big graphs, where the number of edges is far too large for computations in the main memory. The semi-streaming model was introduced by Feigenbaum et al. [11], and can be briefly described as follows:

In the semi-streaming model, the algorithm is allowed to use at most *O* (*n* · polylog(*n*)) bits of RAM where *n* is the number of vertices of the input graph. Because of this restriction, dense graphs where the number of edges is in the order of ω*n* · polylog(*n*), cannot be processed entirely in RAM. Instead, the edges are presented in a stream where the edges are in no particular order. Typically, it is desired to call only a small number of passes (over the input stream).

# **3.1 Our Tree-Based Algorithm**

We give a streaming algorithm for the longest path problem in undirected graphs with a proven per-edge processing time of *O* (*n*) published in the proceedings of the European Symposium on Algorithms in 2016 [16 SPP]. Our algorithm works in two phases, which we outline here briefly and explain in detail in Sect. 3.1. In the first phase, global information on the graph is gathered in form of a constant number of spanning trees *T*1,...,*T*τ . This is possible in the streaming model since roughly speaking, for a spanning tree we can "take edges as they come". A spanning tree can be constructed in just one pass—we however use multiple passes and limit the maximum degree during the first passes in order to favor path-like structures and avoid clusters of edges. Experiments clearly indicate that this degree-limiting is essential for solution quality. The spanning trees fit into RAM, since we consider τ as constant (we will in fact have τ = 1 or τ = 2 in the experiments). After construction of the τ trees, they are merged into one graph *U* by taking the union of their edges. Then we use standard algorithms to determine a long path *P* in *U*, isolate *P*, and finally add enough edges around *P* to obtain a tree *T*.

Then, in the second phase, we conduct further passes during which we test if the exchange of single edges of *T* can improve the longest path in it. (A longest path in a tree can be found by conducting DFS two times [2]; the length of a longest path in a tree is its diameter.) The main challenge in the second phase is to quickly determine which edges should be exchanged. We show that this decision can be made in linear time, hence yielding a per-edge processing time of *O* (*n*).

For a set *X*, we write *x* unif *X* to express that *x* is drawn uniformly at random from *X*.

An example run of the Algorithm is shown in Fig. 6.

## **3.2 Linear Complexity of the Streaming Algorithm**

If the cycle *C* is of length Ω (*n*), then a naive implementation requires Ω (*n*2) to find an edge *e* to remove (temporarily remove each edge on the cycle and invoke the Dijkstra algorithm). However, we have:

**Theorem 2 (Kliemann, Schielke, Srivastav 2016).** *Phase 2 can be implemented with per-edge processing time O* (*n*)*.*

**Algorithm 3:** Streaming Phase 1: Spanning Tree Construction

**Input:** connected graph *G* = (*V*,*E*) as a stream of edges, parameter τ, degree limit sequence *D* = (*D*<sup>1</sup> ...*Dq*<sup>1</sup> ) **Output:** spanning tree of *G* **<sup>1</sup> foreach** *i* = 1,...,τ **do <sup>2</sup>** *Ti* := (*V*,0/) **<sup>3</sup>** SpanningTree(*Ti*) **<sup>4</sup>** *U* := (*V*, τ *<sup>i</sup>*=<sup>1</sup> *E*(*Ti*)) **<sup>5</sup>** find a long path *P* in *U* using Warnsdorf's algorithm **<sup>6</sup>** *T* := (*V*,*E*(*P*)) **<sup>7</sup>** SpanningTree(*T*) **<sup>8</sup> return** *T*

#### **Procedure** SpanningTree(T)

**Input:** forest *T* on *V*, possibly empty **Output:** spanning tree on *V r* =unif [*m*] fast-forward the stream to position *r* **for** *p* = 1,...,*q*<sup>1</sup> **do while** *not at the end of the stream* **do** get next edge *vw* from the stream **if** *T* +*vw is cycle-free and* max{deg*<sup>T</sup>* (*v*),deg*<sup>T</sup>* (*w*)} < *Dp* **then** *T* := *T* +*vw* **if** |*T*| = *n*−1 **then** break rewind the stream to its beginning

#### **Algorithm 4:** Streaming Phase 2: Improvement

**Input:** connected graph *G* as a stream of edges, spanning tree *T*, pass limit *q*<sup>2</sup> **Output:** a (long) path in *G* **<sup>1</sup>** compute longest path *P* in *T* with Dijkstra algorithm **<sup>2</sup> for** *q*<sup>2</sup> *times* **do <sup>3</sup>** rewind the stream to its beginning **<sup>4</sup> while** *not at the end of the stream* **do <sup>5</sup>** get next edge *e* = *vw* from stream **<sup>6</sup> if** *v* ∈ *V*(*P*) *and w* ∈ *V*(*P*) **then** discard and continue with next iteration **<sup>7</sup>** *T* := *T* +*e* **<sup>8</sup>** compute fundamental cycle *C* in *T* **<sup>9</sup>** -<sup>∗</sup> :<sup>=</sup> max*f*∈*E*(*C*)\{*e*} -(*T* − *f*) **<sup>10</sup> if** -<sup>∗</sup> > |*P*| **then <sup>11</sup>** pick any *e* from the set { *f* ∈ *E*(*C*) \ {*e*} : -(*T* − *f*) = -∗} **<sup>12</sup>** *T* := *T* −*e* **<sup>13</sup>** update *P* with longest path in *T* **<sup>14</sup> return** *P*

**Fig. 6.** Example run of the algorithm's steps.

*Proof.* An *O* (*n*) bound is clear for all lines of Algorithm 4, except Line 9 and Line 11. Denote

$$\ell' := \max\_{f \in E(C)} \max \{ |P| : P \text{ is path in } T' - f \text{ and } e \in E(P) \}$$

and let *R* ⊆ *E*(*C*) \ {*e*} be the set of edges where this maximum is attained. Then the following implications hold: - ≤ |*P*| =⇒ -<sup>∗</sup> ≤ |*P*| and - > |*P*| =⇒ - = -∗. This is because if a longest path in *T* − *f* is supposed to be longer than *P*, it must use *e* (since otherwise it would be a path in *T*). Hence it suffices to determine - , and if - > |*P*|, to find an element of *R* .

Denote *<sup>C</sup>* = (*vi*,...,*vk*) the fundamental cycle for some *<sup>k</sup>* <sup>∈</sup> <sup>N</sup> written so that *<sup>e</sup>* <sup>=</sup> *v*1*vk*. When computing - , we can restrict to paths in *T* of the form

$$\left(\ldots,\nu\_{s},\nu\_{s-1},\ldots,\nu\_{1},\nu\_{k},\nu\_{k-1},\ldots,\nu\_{l},\ldots\right)\tag{4}$$

for 1 ≤ *s* < *t* ≤ *k*, where *vs* is the first and *vt* is the last common vertex, respectively, of the path and *C*. For each *i*, let *Ti* be the connected component of *vi* in *T* − *E*(*C*), i.e., *Ti* is the part of *T* that is reachable from *vi* without using the edges of *C*. Denote -(*Ti*) the length of a longest path in *Ti* that starts at *vi* and denote *ci* := -(*Ti*) +*i*−1 and *ai* := -(*Ti*) + *k* − *i*. Then a longest path entering *C* at *vs* and leaving it at *vt*, as in (4), has length exactly *cs* + *at*. Hence we have to determine a pair (*s*,*t*) such that *cs* + *at* is maximum (this maximum value is - ); we call such a pair an *optimal pair*. If the so determined value - is not greater than |*P*|, then nothing further has to be done (the edge *e* cannot give an improvement). Otherwise, having constructed our optimal pair (*s*,*t*), we pick an arbitrary edge (e.g., uniformly at random) from {*vivi*<sup>+</sup><sup>1</sup> : *s* ≤ *i* < *t*}, which are the edges between *vs* and *vt* on *C*. We show that the following algorithm computes the value - and an optimal pair in *O* (*n*).

```
1 compute c1,...,ck−1 and a2,ak using DFS
2 M := 0; L := 0
3 for i = 1,...,k−1 do
4 if ci > M then
5 M := ci
6 s := i
7 if M +ai+1 > L then
8 L := M +ai+1
9 t := i+1
10 return (s,t)
```
The total of computations in Line 1 can be done by DFS in *O* (*n*), and the loop in *O* (*k*) ≤ *O* (*n*). We prove that the final (*s*,*t*) is optimal. For fixed *t*, the best possible length *cs* +*ct* is obtained if *t* is combined with an *s* < *t* where *cs* ≥ *cj* for all *j* < *t*. In the algorithm, for each *t* (when *t* = *i*+1 in the loop) we combine *at* with the maximum max*j*<*<sup>t</sup> cj* (stored in the variable *M*). Thus, when the algorithm terminates, *L* = - and *cs* +*ct* = - .

**Corollary 1.** *Our streaming algorithm (with the two phases as in Algorithm 3 and Algorithm 4) can be implemented with a per-edge processing time of O* (*n*)*.*

We turn to the memory requirement. Denote by *b* the amount of RAM required to store one vertex or one pointer (e.g., *b* = 32*bit* or *b* = 64*bit*) and call *n* · *b* one *unit*.

**Theorem 3.** *Our streaming algorithm (with the two phases as in Algorithm 3 and Algorithm 4) conducts at most* 2*q*<sup>1</sup> +*q*<sup>2</sup> *passes. Moreover, the algorithm can be implemented such that the RAM requirement is at most* (max{4τ, 2τ+4}·*n*+*c*)·*b with a constant c.*

The proof can be found in [16 SPP].

An **experimental study** was conducted on randomly generated instances with different structure, including ones created with the generator for hyperbolic geometric random graphs [18 SPP]. Different variants of our streaming algorithm are compared with four RAM algorithms: Warnsdorf and Pohl-Warnsdorf (two related classical heuristics [23,24]), Pongrácz (a recently published heuristic [25]), and a simple randomized DFS. Experiments show that although we never do more than 11 passes, results delivered by our algorithm are competitive. We deliver at least 71% of the best result delivered by any of the tested RAM algorithms, with the exception of preferential attachment graphs. By considering low percentiles, we observe a similar quality without any restriction on the graph class. This is a good result also in absolute terms, since we observe that for each graph class and set of parameters, there is one algorithm that on average gives a path of length 0.84 · *n*, i.e., 84% of a Hamilton path. On some graph classes, we outperform any of the tested RAM algorithms, which makes our algorithm interesting even outside of the streaming setting.

# **4 An One Pass Streaming Algorithm for Computing the Euler Tour in Graphs**

Large genome sequences (contigs) can be computed in de novo genome assembly with so-called de Bruijn graphs on k-mers ([3,22]). Such graphs are directed. For very large graphs, the computation of an Euler tour cannot be done with known RAM-based algorithms and techniques like semi-streaming or external memory algorithms are sought. In this chapter, we present a survey on our optimal one-pass streaming algorithm for computing an Euler tour in an undirected graph. Our algorithm might be helpful to design a semi-streaming algorithm to compute Euler tours in a directed graph, which is an open problem.

Let *G* be a graph on *n* nodes and *m* edges given in the form of a data stream. We study the problem of finding an Euler tour in *G*. We present a survey on the first onepass streaming algorithm computing an Euler tour of *G* in the form of an edge successor function with only *O*(*n*log(*n*)) RAM based on our paper [13 SPP]. The memory requirement is optimal for this setting according to Sun and Woodruff [28].

## **4.1 The W-Streaming Model and a Lower Bound**

The *W-streaming model* was introduced by Demetrescu et al. [7]. It is a relaxation of the classical streaming model. At each pass, an output stream is written, which becomes the input stream of the next pass. For an Euler tour the successor of each edge in the tour is uniquely defined by its successor function, say δ. Then the output stream has the following form, where the edges are unordered.

... *e* δ(*e*) δ(δ(*e*)) ...

Finding an Euler tour in trees in W-streaming has been studied in multiple papers (e.g., [6]), but the general Euler tour problem has hardly been considered in a streaming model. There are some general results for transferring PRAM algorithms to the Wstreaming model. In general, lower bounds for the complexity of streaming algorithms are hard to prove. Interestingly, Sun and Woodruff [28] showed that even a one-pass streaming algorithm for verifying whether a graph is Eulerian needs Ω(*n*log(*n*)) RAM, and this amount of RAM is also required for a one pass streaming algorithm for finding an Euler tour.

#### **4.2 The Problem of Cycle Merging**

The Euler tour problem in the RAM model can be easily solved by computing edgedisjoint cycles and merging them. We will see, why this is a problem with limited RAM. A *cycle* is a closed walk on the edges of *G* such that every node is visited at most once. The following result is well-known in graph theory.

**Theorem 4.** *If a graph with m edges contains an Euler Tour, it can be decomposed into at most <sup>m</sup>* <sup>3</sup> *pairwise edge-disjoint cycles.*

In fact, this can be accomplished in one pass.

**Theorem 5.** *During the pass, the edges from the input-stream can be ordered in form of a sequence of edge-disjoint cycles.*

*Proof.* 1. Start with *T* := 0/


If *T* = 0 at the end, there are some nodes of odd degree, thus / *G* does not contain an Euler Tour.

Obviously and unfortunately, we cannot store all the cycles in the semi-streaming model. The challenge is to merge cycles, when they are appearing with respect to the memory limitation of *O*(*nlog*(*n*)). We will use the notion of tours or subtours for cycles, too.

The merging of two tours at one node is easy. We just flip edges in canonical way and get the new tour:

Similarly, one can merge several tours at one common node.

The problematic case is the simultaneous merging at two nodes. There is an example.

Unfortunately, the result of this merging is two tours, and the merging failed. A problem only occurs if the cycle shares more than one node with an already existing tour. In this case, we have to make sure that edge-swapping is performed at exactly one of these nodes. Every node belongs to at most one tour at a time, thus all nodes of a tour can get the same label.

## **4.3 The W-Streaming Algorithm and Its Analysis**

We proceed to the pseudo-code statement of our streaming algorithm.

**Algorithm 5:** EULER-TOUR

**input :** Undirected graph *G* = (*V*,*E*), edge by edge on a stream *S* **output:** Euler tour for *G*, i.e. a *successor function* δ∗, if there is one *c* := 0; *F* := 0;/ *E*int := 0; for every / *v* ∈ *V*: *s*(*v*) := 0,*t*(*v*) := 0 **for** *every edge e on S* **do** *E*int := *E*int ∪ {*e*} **if** *G*int = (*V*,*E*int) *contains a cycle C* **then** node MERGE-CYCLE (*C*) **if** *E*int = 0/ **then** ERROR: At least one node with odd degree exists **if** *there exist u*, *v with t*(*u*) = *t*(*v*) = 0 **then** ERROR: Graph is not connected

**<sup>10</sup>** WRITE-F

**Procedure** Merge-Cycle

**input :** Ordered directed cycle *C* = (*v*1,...,*vk*) of length *k*


The output stream is a successor function, i.e. *e*1, δ(*e*1), *e*2, δ(*e*2), ... For *a*,*b*,*c* ∈ *<sup>V</sup>* with (*a*,*b*); (*b*,*c*) <sup>∈</sup> *E* the triple (*a*,*b*,*c*) represents the successor function (*a*,*b*) → δ ((*a*,*b*)) = (*b*,*c*). So, edge (*b*,*c*) is the successor of edge (*a*,*b*). The output stream is *not* necessarily an ordered trail!

The main result is the following theorem [13 SPP].

**Theorem 6 (Glazik, Schiemann, Srivastav, 2017).** *There exists an one-pass W-Streaming algorithm with* O(*n*log*n*) *RAM that outputs an Euler tour on the input graph G (if G contains an Euler tour).*

We sketch the proof. Let δ be a successor function. Equivalence classes: *e* ∈ *E* : [*e*]δ = { *f* ∈ *E*; *e* ≡δ *f* ∃*k*:δ*k*(*e*)=*f* } We identify the successor function with equivalence classes

on *E*.

**Lemma 1 (Algebraic Representation** [13 SPP]**, Lemma 1).** *Let* δ *be a bijective successor function on a directed graph G* = (*V*, *E*)*. Then* ≡δ *is an equivalence relation on E.*

**Lemma 2.** *Let G* = (*V*, *E*) *be a directed graph with bijective successor function* δ *and the related equivalence relation* ≡δ*. Then we have:*

*(i) Let e* <sup>∈</sup> *E and k*1,*k*<sup>2</sup> <sup>∈</sup> <sup>N</sup> *with k*<sup>1</sup> <sup>=</sup> *<sup>k</sup>*<sup>2</sup> *and* δ*<sup>k</sup>*<sup>1</sup> (*e*) = δ*<sup>k</sup>*<sup>2</sup> (*e*)*. Then* <sup>|</sup>*k*<sup>1</sup> <sup>−</sup>*k*2|≥|[*e*] δ |*. (ii) For any e* <sup>∈</sup> *E we have* δ|[*e*] δ | (*e*) = *e.*

*Proof.*(i): *Fs* : *<sup>E</sup>* <sup>→</sup> *<sup>E</sup>*, *<sup>s</sup>* <sup>∈</sup> <sup>N</sup>, *Fs*(*<sup>e</sup>* ) = δ*s*(*k*1−*k*2) (*e* ). – δ*<sup>k</sup>*<sup>2</sup> (*e*) fixpoint of *Fs*. – *M* := {δ- (*e*); *k*<sup>2</sup> ≤ - < *k*1}, |*M*| ≤ *k*<sup>1</sup> −*k*<sup>2</sup> – [*e*]δ⊆ *M* by fixpoint property of *Fs*

The assumption *k*<sup>1</sup> −*k*<sup>2</sup> < |[*e*]δ | implies |*M*| < |[*e*]δ |≤|*M*| → *contradiction* (ii): *r* := |[*e*]δ |. Lets assume for a moment that δ*r* (*e*0) = *e*<sup>0</sup> for some *e*0.

– *M* := {δ- (*e*0); 1 ≤ - ≤ *r*} ⊆ [*e*0]δ

Case 1: *e*<sup>0</sup> ∈ *M*. Then

$$
\delta^0(e\_0) = e\_0 = \delta^\ell(e\_0) \text{ for some } \ell < r.
$$

By (i): -−0 ≥ *r* → *contradiction*

Case 2: *e* ∈/ *M*. Then |*M*| < |[*e*] δ |. By the pigeonhole principle, there exist 1 ≤ *k*1,*k*<sup>2</sup> ≤ |[*e*] δ | with δ*<sup>k</sup>*<sup>1</sup> (*e*) = δ*<sup>k</sup>*<sup>2</sup> (*e*) in contradiction to (i).

Further, a structured theorem is needed. For an edge *e* = (*v*,*w*)let *e*(1) := *v*, *e*(2) := *w*.

**Theorem 7 (Successor function generates Euler tour** [13 SPP]**, Theorem 3).** *Let* = *G* (*V*,*E*) *be a directed graph with bijective successor function* δ *such that e* ≡δ *e for all e*,*<sup>e</sup>* <sup>∈</sup> *E. Then* δ*is the successor function of an Euler tour for G.*

Let δ<sup>0</sup> be the successor function of an edge disjoint cycle decomposition of *G*. The algorithm computes a sequence of successor functions δ∗ <sup>0</sup> = δ0,δ∗ <sup>1</sup> ,...,δ∗ *<sup>N</sup>* := δ∗

**Theorem 8.** *If G is Eulerian,* δ∗ *determines an Euler tour on G.*

The following lemma is the backbone of the proof and requires substantial work.

**Lemma 3 (**[13 SPP]**, Lemma 9).** *Let k* ∈ {0,...,*N*}*. Then,* δ∗ *<sup>k</sup> is bijective and for any* (*u*,*v*),(*u* ,*v* ) ∈ *R*∗(*E*)*, we have*


*Proof (Proof of Theorem* 8*).* We show: If δ<sup>∗</sup> is bijective and *e* ≡δ<sup>∗</sup> *e* for all *e*,*e* , then δ∗ is an Euler tour by Theorem 3. Then, by Lemma 3, δ∗ = δ∗ *<sup>N</sup>* is bijective. For the second property let *e*,*e* ∈ *E*, *e* = (*u*,*v*) and *e* = (*u* ,*v* ). We show *e* ≡δ<sup>∗</sup> *e* . Now, there exists an *u*–*u* –path *P* in *G* because *G* is Eulerian. Let *P* = *u x*<sup>1</sup> *x*<sup>2</sup> ...*xk u* such a path.

By Lemma 3 (ii), label *tN* propagates through *P*:

$$\begin{aligned} t\_N(\boldsymbol{\mu}) &= t\_N(\mathbf{x}\_1) = t\_N(\mathbf{x}\_2) = \dots = t\_N(\mathbf{x}\_k) = t\_N(\boldsymbol{\mu'})\\ \Rightarrow \quad t\_N(\boldsymbol{\mu}) &= t\_N(\boldsymbol{\mu'})\\ \Rightarrow \quad e &\equiv\_{\boldsymbol{\delta}\_N^\*} e'\\ \text{Lemma } & \mathfrak{Z}\_{\boldsymbol{\delta}(i)} \end{aligned}$$

In future work we may investigate other routing problems and applications for streaming algorithms using Euler tours.

# **References**

	- 17. Kokot, M., Dlugosz, M., Deorowicz, S.: KMC 3: counting and manipulating kmer statistics. Bioinformatics **33**(17), 2759–2761 (2017). https://doi.org/10.1093/ bioinformatics/btx304
	- 19. Marçais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics **27**(6), 764–770 (2011). https://doi.org/10.1093/ bioinformatics/btr011
	- 20. Nehls, C.: Effizientes sortier-basiertes Zählen von k-meren im externen Speicher. Mathematisches Seminar, Universität zu Kiel, Masterarbeit (2018)
	- 21. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: Squeakr: an exact and approximate k-mer counting system. Bioinformatics **34**(4), 568–575 (2018). https://doi.org/10.1093/ bioinformatics/btx636
	- 22. Pevzner, P.A., Tang, H., Waterman, M.S.: A new approach to fragment assembly in DNA sequencing. In: RECOMB, pp. 256–267. ACM (2001). https://doi.org/10.1145/369133. 369230
	- 23. Pohl, I.: A method for finding Hamilton paths and knight's tours. Commun. ACM **10**(7), 446–449 (1967). https://doi.org/10.1145/363427.363463
	- 24. Pohl, I., Stockmeyer, L.: Pohl-Warnsdorf revisited. In: Proceedings of the ISC 2004 (2004). https://users.soe.ucsc.edu/~pohl/Papers/Pohl\_Stockmeyer\_full.pdf
	- 25. Pongrácz, L.L.: A greedy approximation algorithm for the longest path problem in undirected graphs. CoRR abs/1209.2503 (2012). withdrawn
	- 26. Rizk, G., Lavenier, D., Chikhi, R.: DSK: k-mer counting with very low memory usage. Bioinformatics **29**(5), 652–653 (2013). https://doi.org/10.1093/bioinformatics/btt020
	- 27. Roy, R.S., Bhattacharya, D., Schliep, A.: Turtle: identifying frequent k-mers with cacheefficient algorithms. Bioinformatics **30**(14), 1950–1957 (2014). https://doi.org/10.1093/ bioinformatics/btu132
	- 28. Sun, X., Woodruff, D.P.: Tight bounds for graph problems in insertion streams. In: APPROX-RANDOM, pp. 435–448. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2015). https://doi.org/10.4230/LIPIcs.APPROX-RANDOM.2015.435
	- 30. Wölfel, P.: Über die Komplexität der Multiplikation in eingeschränkten Branchingprogrammmodellen. Ph.D. thesis, Technical University of Dortmund, Germany (2003). http://hdl.handle.net/2003/2539

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Scalable Text Index Construction**

Timo Bingmann1, Patrick Dinklage2, Johannes Fischer2, Florian Kurpicz2(B) , Enno Ohlebusch3, and Peter Sanders1

<sup>1</sup> Karlsruher Institut für Technologie, Karlsruhe, Germany tb@panthema.net, sanders@kit.edu <sup>2</sup> Technische Universität Dortmund, Dortmund, Germany {patrick.dinklage,florian.kurpicz}@tu-dortmund.de, johannes.fischer@cs.tu-dortmund.de <sup>3</sup> Universität Ulm, Ulm, Germany enno.ohlebusch@uni-ulm.de

**Abstract.** We survey recent advances in scalable text index construction with a focus on practical algorithms in distributed, shared, and external memory.

**Keywords:** Text indices · Suffix array · Suffix tree · Wavelet tree · Burrows-Wheeler transform · FM-index · Distributed memory · Shared memory · External memory

# **1 Introduction**

Texts occur in many different domains, ranging from natural language texts over source code to DNA and protein sequences, and their amount is ever-increasing. The field of algorithm and data structure research on strings is often referred to as *Stringology*. One important aspect within this line of research is the efficient construction of text indices. A text index is a data structure that provides additional information for a given text to speed up answering different types of queries, e.g., pattern matching queries that ask if (or how often, or where) a pattern occurs in the text. We focus on full-text indices for possibly unstructured texts, which allow the user to query for arbitrary patterns (this excludes, e.g., inverted indices). Real-world applications of text indices can be found, for example, in computational biology where text indices are a crucial part of the software for DNA alignment [134]. However, the amount of textual data is increasing significantly faster than the computational capacity of ordinary computers. For example, in 2008 the 1000 Genomes Project (1KGP) was launched to collect and sequence the genomes of thousands of people, whereas, in 2020, the 1+Million Genomes Initiative (1+MG) started to collect at least one million genomes, making this collection 1000 times larger. Therefore, scalable construction algorithms that can handle the massively growing amount of text are necessary.

In this survey, we discuss the current state of the art in scalable text index construction. We focus on distributed, external, and shared memory construction algorithms for different text indices and their applications. While there already exist surveys focussing on particular indices (e.g., suffix arrays [28,172] or wavelet trees [149,160,63 SPP]), or models of computation (e.g., external memory [23,56]), this chapter tries to give

**Fig. 1.** Relations of text indices. In this article, we consider text indices that have scalable construction algorithms. The labels SM, DM, EM mark whether such construction algorithms in shared, distributed, and external memory exist. Note that the LZ77 factorization itself is a text compression and not an index. We use arrows ( ) to denote indices that are used (in practice) to compute the targeted text index or if they are a special case ( ) of the targeted index. Diamonds ( ) are used to denote indices that are part of the targeted text index.

a more unified view. To this end, we point out common techniques that are used in different models of computation or in the computation of different text indices.

This survey is structured as follows. First, in Sect. 2, we introduce models of computation and give an overview of (string) sorting algorithms and further building blocks that are required as basic tools for text index construction. The main body of work can be found in Sect. 3. Here, we discuss the scalable construction of different text indices. We start with the suffix array (*SA*), one of the most well-researched text indices, and the longest common prefix (*LCP*) array, which often accompanies the SA. Next, we take a look wavelet trees (*W T*) and the Burrows-Wheeler Transform (*BW T*), which both are important parts of the FM-index, a compressed text index frequently used in practice. Then, we discuss algorithms for the suffix tree (*ST*) and space efficient representations thereof. See Fig. 1 for an overview of the text indices and their relations. While most of the discussed work solely focuses on the construction of the text indices, we also show approaches to *answer* queries on text indices in distributed memory. Finally, in Sect. 4, we show real-world applications of text indices in bioinformatics and text compression before we address future challenges in Sect. 5.

# **2 Preliminaries**

Let *T* = *T* [0]...*T* [*n* − 2]\$ be a text of length *n* over an alphabet Σ = [0,σ), where we assume that *T* is terminated with an end-of-file or sentinel symbol \$ with \$ ∈/ Σ and \$ < α for all α ∈ Σ. A text over an alphabet of size σ = 2 is called *bit vector*. Usually, bit vectors do not contain a sentinel. We call *T*[*i*.. *j*) = *T* [*i*]...*T* [ *j*−1] a *substring* of *T* for *i*, *j* ∈ [0,*n*]. The substrings *T*[0..*i*) and *T*[ *j*..*n*) are called *prefix* and *suffix* for *i*, *j* ∈ [0,*n*].

## **2.1 Models of Computation**

In this section, we introduce models of computation that are relevant for the rest of this chapter and give pointers to software libraries that are commonly used to implement algorithms in those models. The starting point is the sequential *random access machine* (RAM) model [182], where we have a single *processing element* (PE) that contains multiple registers to perform operations on data and a main memory, which can be accessed in constant time. However, real-world systems are often more complex and require more sophisticated models.

One of these models is the *external memory* (EM) model [4]. Here, we have an internal memory of size *M* words and an external memory of unlimited size that is much slower to access randomly. To compensate for this, transfer between EM and RAM happens in blocks of *B* consecutive words. Such a transfer is called *I/O operation* (I/O for short). The cost of external memory algorithms is then described by the number of required I/Os, e.g., scanning through *N* elements requires Θ-*N B* I/Os, and sorting *N* elements requires sort(*N*) := Θ *N <sup>B</sup>* log *<sup>M</sup> B N B* I/Os. The software libraries STXXL [57] and TPIE [9] implement the most commonly used external memory algorithms and data structures. A (practical) relaxation of the model is the *semi-external* model, where we allow random access to either the input or output, but not both. The Succinct Data Structure Library (SDSL) [94] provides implementations of semi-external construction algorithms for various data structures.

We also consider two parallel machine models, where by *p* we always denote the number of available PEs. The first is the *parallel random access machine* (PRAM), where all PEs have access to the same (shared) memory. There are various PRAM variants differentiating between which types of concurrent memory reads/writes are allowed; for practical algorithms on a multi-core processor one should only use exclusive writes, implying that the *Concurrent Read Exclusive Write* (CREW) model is best for analyzing algorithms. In the analysis, the *work* and *depth* are of interest. The former is the total number of operations performed, and the latter is the longest sequence of sequential dependencies in the algorithm. When implementing shared memory algorithms, Cilk [38] (now deprecated), OpenMP [53], Intel's TBB [174], Microsoft's Parallel Patterns Library (PPL), or built-in concurrency features of the programming language, e.g., thread in C++11, are often used to express parallelism. The Multi-Core Standard Template Library (MCSTL) [188] provides parallel algorithms and can be used as the *parallel mode* of the GNU C++ Standard Library. Recently, ParlayLib [36] was introduced as a library containing efficient implementations of the parallel algorithms in the C++ Standard Library.

The *distributed memory* model is our second parallel machine model. Here, communication between different PEs is conducted by sending messages over a network, and PEs have only local memory. Often, the cost of such a message is given as a startup cost plus a cost that depends on the size of the message. This is also reflected in the *bulk-synchronous parallel* model [200], where algorithms are divided into a sequence of supersteps consisting of three phases: local work, communication, and synchronization. The cost of an algorithm is then the sum of the costs of all supersteps. In practice, there are two flavors of frameworks for developing distributed algorithms: lowlevel interfaces provided by the *message passing interface* (MPI)<sup>1</sup> with its open-source implementations Open MPI [89] and MPICH [98], and frameworks providing a more

<sup>1</sup> MPI standard: https://www.mpi-forum.org/docs (last accessed 2020-07-14).

high-level functionality, e.g., Apache Flink [5], Apache Hadoop (based on MapReduce [54]), Apache Spark [210], and Thrill [29 SPP].

#### **2.2 Building Blocks**

**Sorting.** Sorting is a fundamental and well-studied topic in computer science, and the many results fill entire volumes [129,146] of related work. Hence, we will only review recent results for sorting integers in this section, which can be used in various of the following text indexing algorithms. In applications, sorting is most often still performed using classic sequential algorithms [107,159], despite existing more cache- or instruction-efficient variants [12,65,180,205] and well-developed modern parallel algorithms for shared-memory machines such as IPS4o [17], or the sorters in the MCSTL [188], Intel's TBB [174], the PBBS [186], ParlayLib [36], or Microsoft's PPL. Another method of accelerating sorting is by vectorizing comparisons or operations using SIMD instructions [35,41,87,108,110,207,209].

For sorting integers, there is also the option of using radix sort algorithms, which have to be implemented carefully for modern CPUs [123,152,173]. Many parallel radix sorts for shared-memory machines are also available [138,165,192,203], and are most prominent on GPUs [101,109,154,181,194].

Sorting of data on external memory is a classic subject [4,58], and implementations are available in specialized libraries like TPIE [8] or STXXL [57].

An entirely different challenge is sorting on highly-scalable distributed sharednothing machines, where load balancing, communication, and data redistribution have to be devised carefully, as PEs do not share memory. Most distributed memory sorting algorithms are based on either Quicksort [1,13,16,133,178,196] or sample sort [13,15,37,60,96,106,193,14 SPP].

Sorting is often used as a black box for text indexing algorithms, but depending on the model, machine, or scenario, large performance gains are possible by picking a better sorting implementation.

**String Sorting.** Sorting strings is an interesting special case of sorting, especially for text indexing algorithms, and most classical sorting algorithms have been adapted to multi-component objects or multi-key data [26,33,123,152,162,189]. Early parallel algorithms were formulated in the PRAM model and are based on merging of tries [102,113]. For external memory, theoretical algorithms were proposed, distinguishing short and long strings [7], or using hashing [70]. Many well-developed cache-efficient sequential and shared-memory parallel string sorting algorithms [28, 33,30 SPP] are available in the TLX C++ library2. The fastest sequential ones are engineered variants of radix sort with very little memory overhead, and the fastest sharedmemory parallel one is a string-aware sample sort implementation. These implementations also support outputting the lengths of the longest common prefixes (LCPs) of lexicographically adjacent strings at next to zero extra cost.

<sup>2</sup> TLX website: https://panthema.net/tlx/ (last accessed 2020-10-18).

While in principle the shared-memory parallel algorithms could be adapted to shared-nothing distributed supercomputers, they neglect that *communication volume* is the limiting factor for the scalability of algorithms to large systems [6,39]. The first distributed string sorting algorithm we developed was a straight-forward adaptation of merge sort for use in a distributed suffix array construction algorithm [78 SPP]. This first version still considered strings as unbreakable objects.

Bingmann et al. therefore developed genuine distributed string sorting algorithms based on multi-way merge sort [34 SPP], which break up the strings into characters. The strings on each PE are first sorted locally. The PEs then collectively execute a distributed partitioning algorithm which yields *p* ranges of equal size with respect to the entire data. Each range is spread across the *p* machines in *p* fragments, and in the next step, each PE sends its misplaced *p* − 1 fragments to the corresponding target machine. Finally, each PE merges the received partition fragments. The appeal of multi-way merging for communication-efficient sorting is that the local sorting exposes common prefixes of the local input strings. The *Distributed String Merge Sort* (MS) exploits this by only communicating the length of the common prefix with the previous string followed by the remaining characters. Here, the LCP values also allow us to use the multiway LCPmerging technique previously developed by Bingmann et al. [30 SPP] in such a way that characters are only inspected once.

The second algorithm, *Distributed Prefix-Doubling String Merge Sort* (PDMS), further improves communication efficiency by only communicating characters that may be needed to establish the global ordering of the data (the distinguishing prefix). The algorithm also has optimal local work for a comparison-based string sorting algorithm. The key idea is to apply the communication-efficient duplicate detection algorithm by Sansers et al. [179] to geometrically growing prefixes of each string. Once a prefix has no duplicate anymore, we know that it is sufficient to transmit only this prefix. The same idea was also used to make *any* PRAM algorithm LCP-aware [68 SPP].

An experimental evaluation of MS and PDMS (which are implemented in MPI) on up to 1280 cores shows that these algorithm are often more than five times faster than previous non-string-aware algorithms. In the future, we hope that these algorithms will find their way into general purpose distributed toolkits such as Apache Spark [210] or Thrill [29 SPP].

**Further Building Blocks.** The *prefix sum* (w.r.t. a binary associative operator ⊕) of *<sup>n</sup>* elements *<sup>A</sup>*[0],...,*A*[*<sup>n</sup>* <sup>−</sup> <sup>1</sup>] is an array *<sup>B</sup>* of *<sup>n</sup>* elements with *<sup>B</sup>*[*i*] = *<sup>i</sup> <sup>k</sup>*=<sup>0</sup> *A*[*k*] for *i* ∈ [0,*n*). In the PRAM model, the prefix sum of *n* elements can be computed in *O* (lg*n*) depth and *O* (*n*) work [112, p. 47]. Due to their ubiquity, algorithms for prefix sums are part of frameworks used in different parallel models, e.g., distributed [29 SPP] and shared memory [188].

*Rank* and *select* data structures for a bit vector of length *n* allow us to compute the number of set (or unset) bits up to position *i* ∈ [0,*n*) (rank), and the position of the *j*-th set (or unset) bit for *j* ∈ [1,*n*] (select), respectively. They are an important ingredient of wavelet trees (see Sect. 3.2). To the best of our knowledge, the only parallel construction algorithms for rank and select data structures are described by Shun [185] and require *O* (lg*n*) depth and *O* (*n*/lg*n*) work if the *n* bits are packed into *n*/lg*n* words.


**Fig. 2.** Suffix array and longest common prefix array (see Sect. 3.1) for the text *T* = mississippi\$. Below, we also show the suffixes in lexicographical order, i.e., the suffixes represented in the suffix array. There, we also visualize the longest common prefixes of two lexicographically consecutive suffixes in green ( ). (Color figure online)

In practice, only sequential construction has been considered, e.g., [46,147,211]. However, the construction of the data structures proposed by Zhou et al. [211] heavily relies on prefix sums and could thus easily be parallelized.

We can generalize binary rank and select queries for a text *T*. Then, the function *rank*α(*T*,*i*) counts, for some character α ∈ Σ and a text position *i* ∈ [0,*n*), the number of occurrences of α in *T*[0..*i*], whereas *select*α (*T*,*k*), for some *k* > 0, finds the position of the *k*-th occurrence of α in *T*. Generalized rank/select queries can be answered efficiently using wavelet trees, which reduce them to *O* (lgσ) binary rank/select queries (see Sect. 3.2).

# **3 Text Indices**

A text index provides additional information for a text to speed up answering different types of queries. In the following, we give an overview of different construction algorithms for text indices in the models that we describe in Sect. 2.1.

# **3.1 Scalable Suffix Array Construction**

One of the best-researched text indices is the *suffix array* (*SA*), which has been introduced by Manber and Myers [150] and independently by Gonnet et al. [95] as the PAT array. The *SA* of a text *T* of length *n* is a permutation of [0,*n*) such that *T*[*SA*[*i*],*n*) < *T*[*SA*[ *j*],*n*) for all 0 ≤ *i* < *j* < *n*, i.e., it lists all suffixes lexicographically. See Fig. 2 for an example. Suffix arrays are a space efficient replacement of *suffix trees* (*ST*) (see Sect. 3.3). To obtain the same functionality as the *ST*s, *SA*s are often accompanied by additional arrays containing further information. Since suffix array construction algorithms sort all suffixes of a text, we use the term *suffix sorting* synonymously with suffix array construction.

When both the text and the *SA* fit into memory, the *SA* can be computed in linear time using the *difference cover* algorithm [124]. The idea is to sample suffixes and sort the samples. Using the sorted samples, we can lexicographically compare two suffixes in constant time. First, we compute *SA*<sup>12</sup> containing all suffixes starting at positions that are not a multiple of three, i.e., suffixes starting at positions that are congruent to 1 and 2 modulo 3. To this end, we interpret three characters as one (increasing the alphabet size) and recursively call this algorithm until all characters are unique. Then, the *SA*<sup>0</sup> of all other suffixes is computed using the already computed *SA*12. To obtain the final *SA*, *SA*<sup>0</sup> and *SA*<sup>12</sup> are merged. The algorithm described above is called *DC3*. It can be generalized to other difference covers modulo *X* > 3; then we refer to it as *DCX*. The DCX algorithm can easily be adapted to several models of computation where it also is asymptotically optimal [124]. However, it often impractical due to substantial constant factor overheads, while *induced sorting* algorithms (Sect. 3.1) are superior, at least in the sequential computations. But the latter are hard to parallelize. Closing this gap between theory and practice is an interesting open problem for algorithm engineering. Note that all but one [20] sequential linear time suffix sorting algorithms rely on recursion. The *SA* can be constructed sequentially with only constant space overhead while retaining a linear running time [97,141]. For more information on sequential suffix sorting, we point to two extensive surveys [28,172] and a practical evaluation [19 SPP].

We now give an overview of suffix sorting algorithms in external memory, in shared memory (briefly touching also GPUs), and in distributed memory. Later, we take a look at the LCP array, one of the arrays often supplementing the *SA*.

**External Memory.** Crauser and Ferragina [52] and Dementiev et al. [56] present EM prefix doubling algorithms with discarding. The idea of *prefix doubling* [150] is to sort all suffixes based on the *h-order* ≤*h*, defined by *T*[*i*,*n*) ≤*<sup>h</sup> T*[ *j*,*n*) ⇔ *T*[*i*,*i*+*h*) ≤ *T*[ *j*, *j* +*h*) (=*<sup>h</sup>* and <*<sup>h</sup>* are defined analogously). The *h-rank* of a suffix is the number of suffixes that are strictly smaller w.r.t. the *h*-order. Now, during the *k*-th iteration, we compute the 2*k*-ranks using the 2*k*−1-ranks: for all suffixes *T*[*i*,*n*), we use the ranks of *T*[*i*,*i*+2*k*−1) and *T*[*i*+2*k*−1,*i*+2*k*), which are known from the previous iteration. We stop when ranks are unique; then, each rank is the position of that suffix in the *SA*. In practice, we can *discard* those *h*-ranks that are unique and not needed to compute other ranks any more, which can speed up the sorting, as it reduces the number of elements that we have to sort. For texts with small alphabets, prefix doubling algorithms are in practice often sped up by *alphabet reduction* in combination with *word packing*, e.g., [56,81,32 SPP,78 SPP]. Here, an alphabet of size σ is first mapped to [0,σ ) such that σ ≤ σ each character of the new alphabet occurs at least once in the text and they retain their original order. Then, each character is augmented such that it not only stores *T* [*i*], but also the following *b*/lgσ characters for some suitable bit-width *b*. This makes sense, for example, when there are unused bits already reserved in the binary representation of the characters, as with DNA (σ = 4) stored in bytes (*b* = 4). This allows prefix doubling algorithms to skip the first lg( *b*/lgσ ) iterations. Dementiev et al. [56] also generalize prefix doubling to α-tupling, i.e., considering α*<sup>k</sup>*-ranks during the (*k*+1)-th iteration and present experimental results for their implementations. Here, EM DC3 is superior to all prefix doubling/quadrupling (α = 2 and α= 4) algorithms w.r.t. running time and I/Os. They also show that for small alphabets, DCX can yield further improvements when using difference covers of size 31.

*Induced sorting* (see [144] for a detailed overview) is another prominent approach for EM suffix sorting. It is also used in the fastest sequential main memory suffix sorting algorithms [19 SPP] that are called SAIS [164] and DivSufSort3. This technique has also been generalized to compute the *SA* of collections of strings [145]. The general idea of all EM induced sorting suffix sorting algorithms is to: (1) classify all suffixes into two classes, which can be done in a single scan of the text, (2) sort at most *n*/2 special suffixes, which are suffixes from one of the classes that are (in text order) next to a suffix from the other class, and (3) induce the lexicographical order of all other suffixes using an EM priority queue. The two most prominently used classification schemes are by Itoh and Tanaka [111] and Nong et al. [164]. All following external memory algorithms make use of the latter classification scheme.

Bingmann et al. [31] propose *eSAIS* following the ideas described above. Additionally, eSAIS can also be used to compute the *LCP* array, which we define later in this section. Another EM induced sorting algorithm *DSAIS* is presented by Nong et al. [163]. However, this algorithm assumes that *n* = *O* (*M*2/*B*), which limits the scalability, as the input size is still bounded by the size of the main memory (it is also not faster in practice than eSAIS [122]). An improved version *DSAIS+* by Wu et al. [206] is reported to be faster than eSAIS and also requires around half the disk space. Another EM induced sorting algorithm, called *fSAIS*, is presented by Kärkkäinen et al. [122]. The fSAIS algorithm introduces multiple improvements compared with eSAIS and DSAIS. First, it uses the classification by Nong et al. [164] but switches the classes when it comes to determining the special class, which resolves some corner cases, because now the last suffix *T*[*n* − 1..*n*) cannot be in the special class. Then, a stable priority queue is used, making timestamps to keep track of the order of the induced suffices unnecessary (compared to eSAIS) and thus reducing the I/O volume. Finally, to avoid random access on the text, a simplified *blockwise preinducing* [163] is used, i.e., the text is split into fixed sized blocks and the characters in each block are ordered in the same way they are accessed during the inducing phase. In addition to fewer random access, this makes it unnecessary to store the text positions from which the suffixes is induced. All these improvements halve the I/O volume of the algorithm compared to eSAIS. Han et al. [103] recently presented *nSAIS*, which reduces the I/O volume and required disk space even further.

Another idea for EM suffix sorting is to split the text into consecutive blocks such that the *SA* of the block can be computed in main memory. These *partial SA*s (plus additional information that helps later on) are then merged to obtain the final *SA* [117]. This approach can be parallelized [121] in EM.

**Shared Memory and GPGPU.** On a PRAM, we are only aware of induced sorting algorithms. Labeit et al. [132] present a parallel implementation of DivSufSort. Lao et

<sup>3</sup> Original implementation without publication: https://github.com/y-256/libdivsufsort (last accessed 2020-10-18). Fischer and Kurpicz give a detailed description of the algorithm and extend it to also compute the *LCP* array [77 SPP].

al. present a parallel version of SAIS [136] and SACAK [137], the latter being a simplified version of SAIS. Both are faster on repetitive texts than the parallel DivSufSort. An improved parallel SACAK algorithm, by Xie et al. [208], is the fastest algorithm on most inputs (in their evaluation, the parallel DivSufSort is only faster on two of the non-repetitive inputs).

Finally, we also want to mention *SA* construction using graphics cards (general purpose computation on graphics processing unit, GPGPU). Due to the limited amount of memory available on graphics cards, these algorithms do not scale well. The dominant techniques used in GPGPUs are prefix doubling: either heavily relying on prefix sums [195] or using radix sort [169,202]. DCX algorithms have been presented by Deo and Keely [59] and Wang et al. [202] but are outperformed in practice by the prefix doubling approaches. The latter also present a DCX-prefix-doubling hybrid, which is the fastest GPGPU suffix sorting algorithm.

**Distributed Memory.** In distributed memory, suffix sorting becomes harder than in RAM, as we have to communicate to obtain access to text that is not locally available on a PE; we want to avoid random access on data that is not local. There exist distributed suffix sorting algorithms that are based on merge-sort [128], quicksort [161], and radix sort [2,88]. The DCX algorithm has also been practically evaluated in distributed memory [131,155,32 SPP] 4.

In practice, variants of prefix doubling are most often used, with different implementations of how the new ranks are computed. Kitajima and Navarro [127] presented an early distributed version of Manber and Myers's [150] prefix doubling algorithm, but it requires a lot of bookkeeping. Flick and Aluru's distributed prefix doubling algorithm [81] makes use of the inverse *SA* that is partly computed based on the currently considered *h*-ranks. A further practical improvement is that the algorithm switches to a different strategy for refining the ranks for small groups of suffixes with the same rank; this reduces communication even further. In addition, this algorithm is the only distributed algorithms that supports the computation of the *LCP* array. Two distributed prefix doubling algorithm are presented by Bingmann et al. [32 SPP]. Those algorithms have been implemented in the Thrill framework [29 SPP], which results in some restrictions regarding the access to the distributed data. The first algorithm makes use of a *window* of size 2*<sup>k</sup>* (in the *k*-th iteration) to obtain the required rank, whereas the second one is a prefix doubling with *discarding* algorithm. This idea was later revisited and implemented using MPI [78 SPP]. Here, the prefix doubling algorithm and distributed string sorting (see Sect. 2.2) are used as building blocks for a distributed *induced sorting* suffix sorting algorithm, which is the most memory efficient distributed suffix sorting algorithms currently available, but only works efficiently for small alphabets due to a σ2-factor in space and the number of synchronization steps.

**Longest Common Prefix Array.** The *SA* is often accompanied by different arrays containing useful information to speed up different types of queries. One of the most

<sup>4</sup> DC3/7/13 implementation without publication is available at https://github.com/bingmann/pDCX (last accessed 2020-09-25).

important ones is the *longest common prefix* (*LCP*) array. It contains the lengths of the longest common prefixes of lexicographically consecutive suffixes. More formally, *LCP*[0] = 0 and *LCP*[*i*] = max{- ≥ 0: *T*[*SA*[*i*],*SA*[*i*]+-) = *T*[*SA*[*i*−1],*SA*[*i*−1]+-)}, see Fig. 2. The *LCP* array can be computed sequentially in linear time [125].

There exist *LCP* array construction algorithms based on prefix doubling in distributed memory [81]. In external memory, the *LCP* array can be constructed while executing eSAIS [31]. Alternatively, it can be computed after the computation of the *SA* [115,116]. This EM computation can also be parallelized [118,119]). In GPGPUs, there exists a parallel version of Kasai et al.'s [125] algorithm [59]. We refer to [183] for a extensive evaluation of different shared memory *LCP* array construction algorithms. The *LCP* array construction has also been generalized to collections of strings [67,145].

#### **3.2 Compressed Full-Text Index**

In the following, we consider a space-efficient alternative to the *SA*, the FM-index. We first look at the construction of its two main building blocks, the Burrows-Wheeler transform and the wavelet tree, and then how it can be combined to finally obtain the FM-index.

**Burrows-Wheeler Transform.** The *Burrows-Wheeler transform* (*BW T*) [42] of a text *T* of length *n* is defined by *BW T* [*i*] = *T* [*SA*[*i*]−1 mod *n*]. A different, more verbatim definition of the *BW T* is that we sort the strings *S*<sup>0</sup> = *T* [0]...*T* [*n* − 1],*S*<sup>1</sup> = *T* [1]... *T* [*n* − 1]*T* [0],..., *Sn*−<sup>1</sup> = *T* [*n* − 1]*T* [0]...*T* [*n* − 2] (the *shifts* of *T*) lexicographically. Then *BW T* is the last character of each of the shifts, when the shifts are read in lexicographic order. We call this the *naive* approach. See Fig. 3a for an example of the *BW T*. The first definition of the *BW T* can be translated to a simple construction algorithm based on the *SA*—for which we have seen many construction algorithms in different models of computation in Sect. 3.1. However, there are many algorithms that do not require the computation of the *SA*. In RAM, the best main memory algorithm can compute the *BW T* in time *O* (*n*lgσ/ √lg*n*) for alphabets of size σ<sup>≤</sup> <sup>√</sup>lg*<sup>n</sup>* [126].

On a PRAM, Hayashi and Taura [104] present a construction algorithm that is based on the divide-and-conquer paradigm. They first recursively split the text into consecutive slices (until the size of a slice falls below a threshold). After that, *partial BWT*s are computed for the slices. These partial *BWT*s are then merged in parallel. To speed up merging, additional information, like *SA* samples and *WT*s, is used. Liu et al. [143] present an algorithm that does not merge the partial *BWT*s directly, but only computes partial *SA*s and merges those. However, unlike Hayashi and Taura, they use a single dedicated PE to merge the partial *SA*s, which are computed by all other PEs. Again, additional information, like the *LCP* array (see Sect. 3.1), is used. The *BW T* is then obtained using the final *SA*. Fuentes-Sepúlveda et al. [86] present a parallel version of [157] that considers consecutive slices of size Δ = lgσ*n* as meta-symbols. The *SA* of the concatenation of *S*<sup>1</sup> and *S*<sup>2</sup> (of size 2*n*/Δ) is used to compute a partial *BW T*. Then, all other shifts *Si* (Δ −2 many) are merged (each in parallel) with *S*1. Additional information obtained by the merging is used to update the partial *BW T*.


**Fig. 3.** Burrows-Wheeler transform of the text *T* = mississippi\$ in (a). In (b), we show the wavelet tree (assuming σ = 8) of the Burrows-Wheeler transform depicted in (a). The binary representation of the characters is given in (c). Together, the Burrows-Wheeler transform and the wavelet tree are the FM-index, which we briefly describe in Sect. 3.2.

Ohlebusch et al. [167] consider the *reverse BW T* (*BW T*rev), i.e., the *BW T* of the reverse text *<sup>T</sup>*rev <sup>=</sup> *<sup>T</sup>* [*<sup>n</sup>* <sup>−</sup> <sup>1</sup>]*<sup>T</sup>* [*<sup>n</sup>* <sup>−</sup> <sup>2</sup>]...*<sup>T</sup>* [0] that is of interest for short read mapping (cf. Sect. 4). The sequential version of the algorithm makes use of the wavelet tree of the *BW T* of the text, the *SA*, and the text itself. This leads to independent intervals in *BW T*rev that can easily be computed in parallel. Gilchrist and Cuhadar [93] show that for many applications (cf. Sect. 4.2), the *BW T* is only required for slices of the text. The *BW T* construction for independent slices of the text is easy to parallelize.

Menon et al. [153] give a distributed *BW T* construction algorithm based on MapReduce. Another distributed algorithm based on merging is presented by Wang et al. [201]. This algorithm is tuned for large collection of DNA reads and first partitions the text with respect to a common prefix, and then computing the *BW T* for partitions with a common prefix—similar to the domain decomposition for wavelet trees (cf. Sect. 3.2). Ferragina et al. [72] present an EM version of [104] that is based on merging. Also in EM, *prefix free parsing* [40,130] is used, which is a technique similar to the one used for the asymptotically best sequential *BW T* construction algorithm [126]. The naive *BW T* construction has also been parallelized with FPGAs by Trinidad et al. [198]. Here, the disadvantage is that we actually have to store all shifts *Si*. This approach is also considered on GPGPUs by Patel et al. [170].

As with *SA* and the *LCP* array, the *BW T* has also been generalized for a collection of strings and there exist external memory algorithms for its construction [67,145].

**Wavelet Trees.** For our compressed full-text index, we need to answer generalized rank and select queries (see Sect. 2.2) on the *BW T* efficiently. The *wavelet tree* (*W T*), introduced by Grossi et al. [99], is a binary tree data structure that allows answering both queries in time *O* (lgσ) and can be stored in *n*lgσ(1 + *o*(1)) bits of memory. Each node of the tree represents an interval [*a*,*b*] ⊆ Σ and is labeled by a bit vector that contains one bit for each text position *i*, in text order, where *T* [*i*] ∈ [*a*,*b*]. The bit is set iff *T* [*i*] > (*a* + *b*)/2 . The root node represents the entire alphabet [*a*,*b*] = Σ and is therefore labeled by *n* bits, corresponding to the entire text *T*. A node has two children if |[*a*,*b*]| ≥ 2. Then, recursively, the left child represents the interval [*a*, (*a* + *b*)/2 ], and the right child represents [ (*a*+*b*)/2 +1,*b*]. Finally, the leaves represent intervals of size one or two. Because the alphabet is split in two halves at every node, the tree has height lgσ. Figure 3b shows an example.

Instead of comparing a character to the interval's middle to determine its bit in a node, it is more common to look at the lgσ bits of the characters' binary representations, starting with the most significant bit. Each bit tells whether to go left (zero) or right (one), i.e., characters encode a path down the *W T* starting from the root. In that regard, different codes can be used. A prominent example for using a code other than binary is the *Huffman-shaped W T*, which is constructed based on the characters' canonical Huffman codes. The bit vectors labelling the nodes then require only as much space as the Huffman-compressed text.

Apart from text indexing, the *W T* has applications in more areas, as described in various surveys on the topic [73,100,149,160]. An alternative representation of the *W T*—the wavelet *matrix* (*WM*)—introduced by Claude et al. [47], is a more efficient choice when dealing with large alphabets. It only requires negligible extra space compared to the *W T* and can be used to answer the same queries in the same asymptotic time. However, when answering queries, fewer constant-time binary rank queries are needed on the bit vectors than in the *W T*, making it faster in practice. The similarities and differences between *W T* and *WM* are studied in more in detail by Dinklage [61]. The remainder of this section focuses on algorithms to construct the *W T* in the computational models introduced in Sect. 2.1. We will refer to *levels* of the *W T*, where level - describes the set of nodes with depth -.

We first consider *sequential* construction algorithms. There are various improvements to naïve algorithms to construct the *W T*: Claude et al. [48] and Tischler [197] give the most space-efficient algorithms using only *O* (lg*n*) bits, but do not provide a competitive implementation. Da Fonseca and da Silva [84] give an *online* construction algorithm, i.e., one where no prior knowledge of the input alphabet is required, that runs in time *O* (*n*lgσ) and uses *n*lgσ + *o*(*n*lgσ) bits of space. The fastest known algorithms in theory require time *O* (*n*lgσ/ √lg*n*) and were given by Babenko et al. [18] and Munro et al. [158]. The latter was implemented by Kaneta [114], proving that the use of modern CPU instructions can reflect theoretical improvements also in practice. Kaneta's results are competitive with the currently known fastest and most space-efficient algorithm to construct the *W T*, which has been developed by Fischer et al. [79 SPP]: it is based on *prefix counting* and, except for the topmost level, constructs the *W T* bottom-up as described in the following. In a first scan of *T*, we compute the histogram of *T*, i.e., the frequencies of all characters, as well as the topmost level of the *W T*, which consists of the characters' most significant bits in the same order in which they occur in *T*. For each remaining level - ∈ [2,lgσ), starting with the bottommost level, we first compute the histogram of *T*. This is done by combining the frequencies of every pair in the previous histogram: because a node combines the two intervals of the alphabet represented by its children, the total frequency of its represented characters is the sum of the respective frequencies of its children. The histogram for level allows us to easily compute the positions of the first bit for every node on level -. In one scan of *T*, we can then compute the bits for all nodes on level and directly write them to the correct positions. The algorithm requires total time *O* (*n*lgσ) and σlg*n* bits of space in addition to the input and output. The same technique can be used to construct the Huffman-shaped *W T*, where it also yields the best practical results in terms of speed and space usage.

We now regard the *parallel W T* construction in the *shared memory* model. Labeit et al. [132] gave a recursive algorithm based on the *parallel split* operation. Here, the available PEs process *T* in parallel to compute the bits for the root node. These bits are then used to perform a parallel split of *T* for the left and right child, which are recursively processed in parallel. The number of PEs used to process each child is proportional to the sizes of the children. Two further techniques for parallel *W T* construction stand out: *domain decomposition* and an algorithm based on *sorting*. The use of domain decomposition for *W T* construction has first been proposed by Sepúlveda et al. [85]. The input *T* is partitioned such that every PE receives a slice of size *n*/*p*. and computes the entire *W T* for its slice using any sequential algorithm, e.g., prefix counting. In a subsequent step, these *W T* are merged into the *W T* for *T*, which can be done efficiently by concatenating the bit vectors contained in the corresponding nodes. Because an arbitrary sequential construction algorithm can be used locally, domain decomposition can be tuned to have a very low memory footprint. The algorithm based on sorting, first proposed by Shun [184], constructs the *W T* top-down, level by level, and makes use of stable integer sorters, which are well studied for all practically relevant computational models. The bits of the topmost level can be computed in an initial parallel scan of *T*, similar to the (sequential) prefix counting algorithm. Then, before proceeding to some level - > 1, the text is reordered by stably sorting the characters according to their --bit prefixes, which puts them in the correct positions to compute that level's bit vector in a parallel scan of the reordered text. To that end, the algorithm only requires lgσ parallel scans of *T*. For both algorithms, Shun presents techniques that allow for different trade-offs between work and time [185]. The best known implementations were given by Fischer et al. [79 SPP], concluding that domain decomposition is the fastest approach in practice, also for constructing the Huffman-shaped *W T*.

The parallel construction in *distributed memory* has been studied by Dinklage et al. [64 SPP], confirming the practical relevance of domain decomposition, which yields the fastest running times and best memory efficiency in practice. An important measure for distributed memory algorithms is the communication volume. During the distributed domain decomposition, only the merging phase requires communication between the PEs. They also adapted Shun's parallel sorting algorithm [184] to distributed memory and achieved nearly as good running times, albeit requiring more communication. Because the individually constructed levels need not be partitioned into nodes, the sorting algorithm has furthermore been found to be better suited than domain decomposition for constructing the *W T* for large alphabets.

Finally, we look at *W T* construction in *external memory*. Ellert and Kurpicz [69 SPP] present sequential and parallel external memory algorithms. The sequential algorithm is based on sorting and works similar to the corresponding parallel algorithm. Using only a constant amount of main memory, it requires two scans of *T* for each level of the *W T*. They also provide various semi-external algorithms with similar properties, all of which outperform the semi-external *W T* construction algorithms from the Succinct Data Structure Library (SDSL) [94]. Finally, their parallel algorithm makes use of domain decomposition to distribute work on the available PEs, each PE using a sequential in-memory algorithm (e.g., prefix counting) to construct a partial *W T*. Because the *p* parts of *T* may not fit into main memory, each PE furthermore partitions its part into segments of size *k* such that a segment and its *W T* does fit in main memory. They then process their part segment by segment. The algorithm requires four scans over *T* for each level, plus σ random I/O operations for each segment. Naturally, because of the necessary synchronizations with external memory, the algorithm only scales well up to a limited number of PEs. Yet, the parallelization achieves a notable speedup in practice.

**FM-Index.** The FM-index [74] combines the *BW T* and (Huffman shaped) *WT*s to a compressed full-text index. It is widely used, in particular in most DNA read aligners [134] and in Bioinformatics in general (cf. Sect. 4.1).

To *locate* a pattern using the FM-index, a *backward search* is performed. Using the *C* array (for each α ∈ Σ, *C*[α] is the overall number of occurrences of characters in *BW T* that are strictly smaller than α, i.e., the rank of α in Σ) and the *W T* of the *BW T* to answer *rank*α(*i*) (on the *BW T*, cf. Fig. 3) it is possible to search backwards for a pattern in *T* [74]: Given an ω-interval [*i*, *j*] (i.e., ω is a prefix of *T*[*SA*[*k*]..*n*) if and only if *i* ≤ *k* ≤ *j*) and α ∈ Σ, the procedure called *backwardSearch*(α,[*i*, *j*]) returns the αω-interval [*lb*,*rb*], where *lb* = *C*[α] + *rank*α [*i*] + 1 and *rb* = *C*[α] + *rank*α [ *j* + 1]. If *lb* > *rb*, the pattern does not occur in *T*.

Note that any combination of *BW T* and *W T* construction algorithms can be combined to compute the FM-index (in any model of computation). Still, there exist dedicated practical PRAM FM-index construction algorithms by Labeit et al. [132] and Lio et al. [143]. The former combines their parallel *SA* (see Sect. 3.1) and *W T* (see Sect. 3.2) construction algorithms to compute an FM-index (in parallel), whereas the latter provides a parallel algorithm that computes both the *BW T* and the FM-index.

## **3.3 Suffix Trees**

A suffix tree (*ST*) for a string of length *n* is a compact trie storing all the suffixes of *T*, i.e., the concatenation of the edge labels on the path from the root to leaf *i* exactly spells out the suffix *T*[*i*..*n*); see Fig. 4a for an example. Weiner [204] showed that it can be constructed in linear time provided that the underlying alphabet has constant size. Farach-Colton et al. [71] gave the first suffix tree construction algorithm that is optimal for all alphabets. It has linear run-time for alphabets consisting of integers in a polynomial range. The *ST* is one of the most powerful data structures in string processing, with applications in fields like bioinformatics or information retrieval, e.g., [21,45].

Abouelhoda et al. [3] showed that there is a one-to-one correspondence between the set of all lcp-intervals and the set of all internal nodes of the *ST* of *T*. Let us define the concept of lcp-intervals (see Fig. 4a). An interval [*i*, *j*] in the *LCP* array—for simplicity, we now assume that *LCP*[0] = −1 = *LCP*[*n*]—is called an *lcp-interval of lcp-value* if (1) *LCP*[*i*] < -, (2) *LCP*[*k*] ≥ for all *k* with *i* < *k* ≤ *j*, (3) *LCP*[*k*] = for at least one *k* with *i* < *k* ≤ *j*, and (4) *LCP*[ *j* +1] < -. Every index *k* (*i* < *k* ≤ *j*) with *LCP*[*k*] = is called -*-index* or *lcp-index*. A leaf in the *ST* corresponds to a *singleton interval* [*k*,*k*]. The parent interval of an lcp-interval [*i*, *j*] (or a singleton interval) is the smallest lcpinterval that contains [*i*, *j*] but does not coincide with [*i*, *j*].

The drawback of *ST*s is their huge space consumption: even carefully engineered implementations require 8–20 bytes per input character. It is possible to save a lot of space by representing the *ST* topology by a sequence of balanced parentheses. The sequence BPS, for instance, can be constructed by a depth first search traversal of the (uncompressed) *ST* as follows. At each node *v* (starting at the root), write an opening parenthesis, recursively process the child nodes of *v*, and write a closing parenthesis afterwards (see Fig. 4b). Since the *ST* has *n* leaves and up to *n* − 1 internal nodes, the BPS needs up to 4*n*−2 bits. Based on the BPS, all navigational operations on the *ST* can be supported with data structures that require only *o*(*n*) bits [177].

The BPS can be constructed in parallel on a shared memory architecture in the CRCW PRAM model with the help of the *LCP* array as follows; see [22] for details, where also the necessary adjustments for the CREW model are explained. Create two arrays *Co* and *Cc* of size *n*, enumerate all lcp-intervals in parallel, and increment *Co*[*i*] and *Cc*[ *j*] for each lcp-interval [*i*, *j*]. After that, compute the prefix sum *PS* of *sum*[*i*+1] = *Co*[*i*] +*Cc*[*i*], and write *Co*[*i*] opening followed by *Cc*[*i*] closing parenthesis at position *PS*[*i*] into the bitvector BPS (in parallel). It is possible to enumerate all lcpintervals (in parallel) with the help of the arrays *PSV* (previous smaller value) and *NSV* (next smaller value), which are defined as follows:

$$\begin{aligned} PSV[i] &= \max\{ j \mid 0 \le j < i \text{ and } LCP\left[j\right] < LCP\left[i\right] \}, \\ NSV[i] &= \min\{ j \mid i < j \le n \text{ and } LCP\left[j\right] < LCP\left[i\right] \} \end{aligned}$$

Table 1 shows an example (an entry ⊥ means that the value is undefined). The key observation is that for any index *i* with 0 < *i* < *n* and *LCP*[*i*] = the interval [*PSV*[*i*],*NSV*[*i*] − 1] is an lcp-interval of lcp-value and *i* is one of its lcp-indices; for a proof see, e.g., [166, Lemma 4.3.8]. A problem of this approach is that an lcpinterval with multiple --indices will occur more than once in the enumeration. To overcome this problem, such an interval is reported if and only if *i* is the first (leftmost) --index of the interval. To this end, previous smaller values (*PSV*) are replaced with previous smaller or equal values (*PSEV*), where the array *PSEV* is defined by *PSEV* [*i*] = max{ *j* | 0 ≤ *j* < *i* and *LCP*[ *j*] ≤ *LCP*[*i*]}. Then [*PSV* [*i*],*NSV*[*i*]−1] appears in the enumeration if and only if *LCP*[*i*] = *LCP*[*PSEV* [*i*]].

The problem of computing previous smaller and next smaller values, also known as the all-nearest-smaller-value problem (ANSV), was already solved by Berkman et al. [27] with *O* (*n*) work and in *O* (loglog*n*) time using *O* (*n*/loglog*n*) processors on a CRCW PRAM.

**Fig. 4.** In (a), the suffix tree for *T* = mississippi\$ (an annotation --[*i*, *j*] within a node shows the corresponding lcp-interval, i.e., is the string-depth of the node and [*i*, *j*] is the corresponding interval). The number below the leafs is the starting position of the corresponding suffix. For the corresponding *SA* and *LCP* array see Fig. 2. The BPS of the suffix tree is shown in (b). Matching parentheses are connected by dotted lines. A leaf in the suffix tree is represented by an opening parenthesis that is immediately followed by a closing parenthesis; its leaf number (leaf *i* represents suffix *T*[*i*..*n*)) is depicted above the two parentheses. The last row shows the positions of the parentheses in the BPS.

In the following, we will focus on distributed memory. He and Huang [105] presented a bulk-synchronous parallel adaption of the algorithm invented by Berkman et al. [27]. Flick and Aluru [82] improved their work in various directions by introducing a generalized version of the ANSV problem. They showed how to handle duplicate values, generalized the communication structure, and provided novel proofs. Based on the improvements on the ANSV problem, they presented a parallel *ST* construction algorithm using the suffix- and *LCP* array that runs in *O* (*n*/*p* + *p*) time, which is work optimal for *p* = *O* ( √*n*). In a first phase, they represent the *ST* as an array *E*



of edges (*i*, *parent*(*i*)). This approach requires a unique representative index for each node *v* in the *ST*. Since *v* corresponds to an lcp-interval, one can choose the first (leftmost) lcp-index of that lcp-interval as a representative. Moreover, Lemma 1 shows that the representative of the parent interval can be computed with the help of "previousfurthest-equal" values, defined for all *i* with 1 < *i* < *n* as follows:

*PFE*[*i*] = min{ *j* | *PSV*[*k*] < *j* < *i* and *LCP*[ *j*] = *LCP*[*k*], where *k* = *PSEV* [*LCP*[*i*]]}

**Lemma 1.** *Recall that, for any index i with* 1 < *i* < *n, the interval* [*lb*,*rb*]*, where lb* = *PSV*[*i*] *and rb* = *NSV*[*i*] − 1*, is an lcp-interval and i is an lcp-index of* [*lb*,*rb*]*. In the following, let m* = *PFE*[*i*]*. If LCP*[*m*] = *LCP*[*i*]*, then m is the representative lcp-index and i is a different lcp-index of* [*lb*,*rb*]*. From now on we assume LCP*[*m*] < *LCP*[*i*]*. In this case, we have LCP*[*m*] = *LCP*[*PSV*[*i*]]*. If LCP*[*PSV*[*i*]] < *LCP*[*NSV*[*i*]]*, then NSV*[*i*] *is the representative lcp-index of the parent interval of* [*lb*,*rb*]*; see [166, Lemma 4.3.9]. Otherwise, PSV*[*i*] *is an lcp-index of the parent interval of* [*lb*,*rb*] *and m is the representative lcp-index of that parent.*

Flick and Aluru's algorithm assumes that all inputs are distributed equally across processors with *n*/*p* elements per process. It computes *PFE* and *NSV* in *O* (*n*/*p*+ *p*) time. Since the processor for the range [ *<sup>n</sup> <sup>p</sup> <sup>j</sup>*, *<sup>n</sup> <sup>p</sup>* (*j* + 1) − 1] has the corresponding portions of *LCP*, *PFE*, and *NSV* in local memory, it can compute edges (*i*, *parent*(*i*)) in its range based on Lemma 1. The parents of leaf nodes in its range can be computed similarly; see [82] for details. In the second phase of their algorithm, Flick and Aluru show how edges can be inverted (and analyse the communication complexity). This is because for pattern matching applications, instead of having parent pointers, each internal node should point to its children. Other algorithms for distributed *ST* include [44,50,212].

# **3.4 Query Answering**

Up to this point, we only have considered the construction of different full-text indices. Since all full-text indices that we have looked at have their origin in RAM, they can easily be used there by allocating the incoming queries in a round robin fashion to the PEs. However, in external or distributed memory, the obstacle is that neither the whole text nor the whole index can be accessed in a random access manner, as in the construction algorithms. In this section, we take a look at different approaches to answer queries in such a setting.

Clifford [49] show how to use a suffix tree in distributed memory to answer different types of queries. They build the suffix tree using Ukkonen's algorithm [199]. For this purpose, the whole text is required at each PE, limiting the scalability of this approach significantly.

Mäkinen et al. [148] use the *compressed* suffix array (CSA) [176] in distributed and external memory. The CSA requires roughly the same space as the *compressed* text but also does not need the text to answer queries (unlike the *SA*); it is a *self-index*. In main memory, queries of length *m* can be answered in *O* (*m*lg*n*) time. They improve query times by sampling --length strings instead of characters and encoding the supporting data structures using Elias delta encoding in combination with lookup tables. This allow for constant time access to the supporting data structures (not queries). In EM, their approach can search for a pattern of length *m* in *O* (*m*lg*Bn*) I/Os (which can be reduced to *O* ((*m*lg*n*)/*B*) if *O* (*n*) bits can be stored in main memory). In distributed memory, *m* supersteps are required to answer such a query. During each superstep only a constant number of words have to be communicated and *O* (lg*n*) local work is required.

Arroyuelo et al. [10] compare different layouts of the *SA* for pattern matching in distributed memory. When each PE holds a consecutive slice of the *SA*, we have the *global* layout. In addition to the *SA*, pruned suffixes are stored to speed-up querying at the local PE. To speed up queries, a trie for the suffixes at the beginning and end of each slice is built at each PE in order to distribute queries to the PE that can answer it locally. Next, in the *local* layout, each PE holds a consecutive slice of the text and builds a *SA* only for this local slice. Here, each PE must answer the query locally and return the result, requiring only a constant number of supersteps but significant local work (as all PEs always have to search for the query). The *multiplex* layout is an intermingled global layout, where the *i*-th entry of the global *SA* is stored at PE *i* mod *p* in consecutive fashion, i.e., the *i*-th and (*i*+ *p*)-th entry are stored consecutively at the same PE. Corresponding pruned suffixes are stored as in the global layout. The multiplex layout (and in some cases the global layout) is the most efficient one in their experiments. They also propose two additional layouts that, however, perform not as well in practice.

The global layout is extended by Fischer et al. [80 SPP]. Instead of answering the query directly on the *SA*, a Patricia trie [156] is constructed for each local slice. To this end, the *LCP* array (see Sect. 3.1) is required. Furthermore, a global trie is used to distribute the query to the corresponding PE. These two tries together allow queries to be answered with a constant number of supersteps.

Flick and Aluru [83] further improve the above two-level designs by developing the *distributed enhanced SA* (DESA). One improvement of DESA is to eliminate the explicitly stored tree structure of the two-level indices. Also, the DESA does not partition the text into consecutive slices of the same size (where queries may have to be answered on multiple PEs) but partitions the text into more fine grained intervals, such that all intervals can be processed on a single PE. This approach currently scales best in practice. The local search is an adapted version of Fischer and Heun's [75] query algorithm for enhanced *SA*s [3]. Hence, they only need the *SA*, *LCP* array, *range minimum queries* (RMQs, returning the positions of the smallest element in a given range), and some additional information. To help load balancing queries, the top level trie is dynamically (based in the input) constructed such that each bottom level index, corresponding to a leaf in the top level trie, covers intervals of size *n*/*cp* for a constant *c*. To this end, the ANSV problem (cf. Sect. 3.3) is solved. This approach is significantly better than a static top level lookup table and overall the best in practice.

# **4 Applications**

All previously described text indices have more applications than (exact) pattern matching. They can also be used to answer approximate queries, i.e., when allowing differences between the pattern and the matched positions. Pockrandt [171] shows how to transform those queries into exact queries. This is also used in practice in the SeqAn library [175], which contains efficient algorithms and data structures for the analysis of biological data (and strings in general). Furthermore, they can be used to compute succinct de Bruijn graphs, all pairs suffix-prefix overlaps, and maximal repeats [67]. In the following, we take a more detailed look into two fields where text indices are of great importance—*Bioinformatics* (Sect. 4.1) and *lossless compression* (Sect. 4.2).

## **4.1 Bioinformatics**

The most successful application of index structures in bioinformatics is backward search based on an FM-index [74] (e.g., in form of the *W T* of the *BW T* of the input string [99], cf. Sect. 3.2). For information on *k*-mer-based tools, we refer to the recent survey by Marchet et al. [151].

The most important application of backward search in bioinformatics is read mapping. Ultra-high-throughput next-generation sequencing technologies (NGS) have been commercially available since 2005. In NGS, DNA is fragmented into small pieces, of which the first few bases are sequenced, yielding several millions of short "reads", each 30 to 400 base pairs ("DNA characters") long. The read mapping task is now to align these reads to a reference genome, i.e., to the known, nearly complete chromosomal DNA sequences of the organism in question (which may be up to several billion base pairs long); see [43] for an overview article.

Short read mappers like Bowtie [135] or BWA [140] must be able to deal with (sequencing) errors. Inexact matching is either based on recursive algorithms that use backtracking or on the seed-and-extend strategy (exact matches are used as seeds and the shared seeds are then extended into longer, inexact alignments). The same approach has also been successfully applied in genome assembly [187] (sequence assembly refers to aligning and merging reads in order to reconstruct the original sequence). Here, the fastest implementations only utilize a few threads [24]. Those read mappers usually do not use parallel construction algorithms, as the reference sequences are short, allowing a space-efficient sequential algorithm to compute the index in less than an hour.

Alignments of longer sequences (ranging from long read mapping to whole genome alignment) are also obtained by exact matching and the seed-and-extend method. One of the earliest tools in comparative genomics is based on suffix trees (and later on suffix arrays) [55], but there are also tools using the *BW T* [142]. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences. Then, they look for orthologous sequences (sequences that share a common ancestry) in the aligned genomes and check to what extent those sequences are conserved. Nowadays, one tries to take multiple genomes simultaneously into account; see [51] for an overview of pangenomics. When it comes to the alignment of longer sequences, scaling algorithms are used, e.g., multithreaded semi-external prefix-doubling algorithms [190] or building multiple partial indices (in parallel) and merging them [191]. Compressed suffix trees and FM-indices have been used in indexing variation graphs [92] and for graphical pangenome analysis [21]. In particular, the balanced parentheses sequence BPS from Sect. 3.3 was used for indexing variation graphs [190] (using the algorithm described in [168]). Using a dynamic FM-index, sequences can be inserted in batches, which can easily be parallelized [139].

# **4.2 Compression**

Text indices have been successfully applied to text compression, most notably to compressors based on the *BW T* (see Sect. 3.2) and on different variants of the Lempel-Ziv parsing of the text. Intuitively, this link between indexing and compression seems plausible, as in both cases one tries to 'group' similar substrings; in the former for listing occurrences, in the latter for exploiting the repetitiveness to somehow save space. We only consider compressors that operate over the *full* text (*not* restricted to small sliding windows/blocks); this is important for highly repetitive texts such as DNA collections of individuals from the same species.

**Lempel-Ziv in External Memory.** The LZ77-factorization [213] of a text *T* is defined as follows: suppose *T*[0..*i*) has already been parsed into LZ77-phrases. Then the next LZ77-phrase is the longest prefix of *T*[*i*..*n*) that has an occurrence in *T* starting strictly before *i* (but possibly ending in *T*[*i*..*n*)), or a single character if *T* [*i*] does not occur before. Given a text index on *T*, this prefix can be located by iteratively querying for *T*[*i*..*i* + 1), *T*[*i*..*i* + 2), . . . , as long as an occurrence starting before *i* exists. In main memory, Fischer et al. [76] have the most space efficient implementation of this idea using compressed variants of the suffix tree, needing only (1 + ε)*n*log*n* + *O* (*n*) bits of space and running in *O* (*n*/ε) time. The difficulty in EM is, of course, that such repeated querying causes too many I/Os. Kärkkäinen et al. [120] avoid this in two ways: in their EM-LPF algorithm, they first compute the array of *longest previous factors* in EM, from which the LZ77 factorization is easily obtained in total sort(*n*) I/Os. Their second algorithms, EM-LZScan, divides *T* into blocks of size Θ (*M*) and then computes *matching statistics* [166, Sect. 5.5.4] of the current block w.r.t. the prefix of *T* up to the current block. EM-LZScan needs *O n*<sup>2</sup> logσ *BM* log*n* I/Os in theory, but is significantly faster in practice than EM-LPF for highly repetitive texts. A different approach was taken by Dinklage et al. [62 SPP], who show that the flexibility of allowing factor occurrences also be to the *right* of their starting position (so-called *bidirectional parsings*) leads to a much better throughput than EM-LZScan, while achieving similar compression rates. Their algorithm plcpcomp has been successfully applied to texts of size 128 GiB on a machine with just 16 GiB of RAM. Considering *de*compression, Belazzougui et al. [25] show the I/O-complexity to be sort(*n*/logσ *n*) I/Os and also give a practical implementation; however, this algorithm cannot be applied to the bidirectional variant, which is much slower at decompression. Other variants of LZ exist, but have so far not been successfully applied to large datasets, although promising approaches exist for LZ78 that might lead to semi-external solutions [11].

**Parallel Burrows-Wheeler-Based Compression.** In Sect. 3.2, we already mentioned the relevant literature for computing the *BW T L*. This output can be postprocessed to compute a compressed version of *T*, as characters following a similar preceding context are grouped in *L*. The postprocessing consists of computing the move-to-front numbers when processing *L* from left to right, followed by a Huffman encoding of the resulting numbers. On the PRAM, Edwards and Vishkin [66] show how to perform those latter steps in *O* (log*n*) parallel time and *O* (*n*) work, and report good speedups on FPGAhardware over popular tools such as bzip2, although only using moderately-sized inputs. They also show how to decompress the resulting file within the same complexities. At their core, the algorithms are reduced to the building blocks prefix sums (cf. Sect. 2.2) and list ranking. Geared more towards practice, Patel et al. [170] have similar ideas and show GPGPU implementations; however, they use mergesort for computing the *BW T* and report this as their main bottleneck.

We are not aware of any algorithms in external or distributed memory implementing the full *BW T* compression pipeline, despite that algorithms for computing the *BW T* exist in these models of computation (see Sect. 3.2).

# **5 Conclusion and Future Work**

Advanced text index data structures such as suffix trees, suffix arrays and wavelet trees are key to handling large data sets in a range of important applications. A combination of parallel, external, and compressed implementations can approach the requirements for handling the exploding amounts of available data.

In this short survey, we have discussed a number of techniques for building and using such data structures. Our impression is that memory hierarchies and compression by themselves are fairly well understood by now. A range of parallelization approaches are known but they suffer from a tradeoff between asymptotic scalability and efficiency. In particular, the most efficient sequential and external techniques are inherently sequential. Hence, a number of important open problems remain. These involve highly scalable techniques with good constant factors (e.g., for constructing suffix arrays and *LCP* arrays with linear work) as well as integration of parallelism, memory hierarchies, compression and applications. Another interesting research direction is to engineer recent text indices for highly repetitive data [91] for handling large texts. In principle, big data frameworks such as Thrill [29 SPP] can handle parallelization and memory hierarchies automatically but the question remains whether the involved overheads are acceptable.

# **References**

	- 15. Axtmann, M., Sanders, P.: Robust massively parallel sorting. In: ALENEX, pp. 83–97. SIAM (2017). https://doi.org/10.1137/1.9781611974768.7
	- 16. Axtmann, M., Wiebigke, A., Sanders, P.: Lightweight MPI communicators with applications to perfectly balanced quicksort. In: IPDPS, pp. 254–265. IEEE Computer Society (2018). https://doi.org/10.1109/IPDPS.2018.00035
	- 17. Axtmann, M., Witt, S., Ferizovic, D., Sanders, P.: Engineering in-place (sharedmemory) sorting algorithms. ACM Trans. Parallel Comput. **9**(1), 2:1–2:62 (2022). https://doi.org/10.1145/3505286
	- 18. Babenko, M.A., Gawrychowski, P., Kociumaka, T., Starikovskaya, T.: Wavelet trees meet suffix trees. In: SODA, pp. 572–591. SIAM (2015). https://doi.org/10.1137/1. 9781611973730.39
	- 20. Baier, U.: linear-time suffix sorting a new approach for suffix array construction. In: CPM, pp. 23:1–23:12. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2016). https://doi.org/10.4230/LIPIcs.CPM.2016.23
	- 21. Baier, U., Beller, T., Ohlebusch, E.: Graphical pan-genome analysis with compressed suffix trees and the Burrows-Wheeler transform. Bioinformatics **32**(4), 497–504 (2016). https://doi.org/10.1093/bioinformatics/btv603
	- 22. Baier, U., Beller, T., Ohlebusch, E.: Space-efficient parallel construction of succinct representations of suffix tree topologies. ACM J. Exp. Algorithmics **22** (2017). https:// doi.org/10.1145/3035540
	- 23. Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Exp. **40**(11), 965–988 (2010). https:// doi.org/10.1002/spe.960
	- 24. Bauer, M.J., Cox, A.J., Rosone, G.: Lightweight algorithms for constructing and inverting the BWT of string collections. Theor. Comput. Sci. **483**, 134–148 (2013). https:// doi.org/10.1016/j.tcs.2012.02.002
	- 25. Belazzougui, D., Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lempel-Ziv decoding in external memory. In: Goldberg, A.V., Kulikov, A.S. (eds.) SEA 2016. LNCS, vol. 9685, pp. 63–74. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-38851-9\_5
	- 26. Bentley, J.L., Sedgewick, R.: Fast algorithms for sorting and searching strings. In: SODA, pp. 360–369. ACM/SIAM (1997)
	- 27. Berkman, O., Schieber, B., Vishkin, U.: Optimal doubly logarithmic parallel algorithms based on finding all nearest smaller values. J. Algorithms **14**(3), 344–370 (1993). https://doi.org/10.1006/jagm.1993.1018
	- 28. Bingmann, T.: Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools. Ph.D. thesis, Karlsruhe Institute of Technology, Germany (2018). https://doi.org/10. 5445/IR/1000085031
	- 31. Bingmann, T., Fischer, J., Osipov, V.: Inducing suffix and LCP arrays in external memory. ACM J. Exp. Algorithmics **21**(1), 2.3:1–2.3:27 (2016). https://doi.org/10.1145/ 2975593
	- 33. Bingmann, T., Sanders, P.: Parallel string sample sort. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 169–180. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40450-4\_15
	- 35. Blacher, M., Giesen, J., Sanders, P., Wassenberg, J.: Vectorized and performanceportable quicksort. CoRR abs/2205.05982 (2022)
	- 36. Blelloch, G.E., Anderson, D., Dhulipala, L.: Parlaylib A toolkit for parallel algorithms on shared-memory multicore machines. In: SPAA, pp. 507–509. ACM (2020). https:// doi.org/10.1145/3350755.3400254
	- 65. Edelkamp, S., Weiß, A.: Blockquicksort: avoiding branch mispredictions in quicksort. ACM J. Exp. Algorithmics **24**(1), 1.4:1–1.4:22 (2019). https://doi.org/10.1145/ 3274660
	- 66. Edwards, J.A., Vishkin, U.: Parallel algorithms for burrows-wheeler compression and decompression. Theor. Comput. Sci. **525**, 10–22 (2014). https://doi.org/10.1016/j.tcs. 2013.10.009
	- 67. Egidi, L., Louza, F.A., Manzini, G., Telles, G.P.: External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol. Biol. **14**(1), 6:1–6:15 (2019). https://doi.org/10.1186/s13015-019-0140-0
	- 70. Fagerberg, R., Pagh, A., Pagh, R.: External string sorting: faster and cache-oblivious. In: Durand, B., Thomas, W. (eds.) STACS 2006. LNCS, vol. 3884, pp. 68–79. Springer, Heidelberg (2006). https://doi.org/10.1007/11672142\_4
	- 71. Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. J. ACM **47**(6), 987–1011 (2000). https://doi.org/10.1145/355541. 355547
	- 72. Ferragina, P., Gagie, T., Manzini, G.: Lightweight data indexing and compression in external memory. Algorithmica **63**(3), 707–730 (2012). https://doi.org/10.1007/ s00453-011-9535-0
	- 73. Ferragina, P., Giancarlo, R., Manzini, G.: The myriad virtues of wavelet trees. Inf. Comput. **207**(8), 849–866 (2009). https://doi.org/10.1016/j.ic.2008.12.010
	- 81. Flick, P., Aluru, S.: Parallel distributed memory construction of suffix and longest common prefix arrays. In: SC, pp. 16:1–16:10. ACM (2015). https://doi.org/10.1145/ 2807591.2807609
	- 82. Flick, P., Aluru, S.: Parallel construction of suffix trees and the all-nearest-smallervalues problem. In: IPDPS, pp. 12–21. IEEE Computer Society (2017). https://doi.org/ 10.1109/IPDPS.2017.62
	- 83. Flick, P., Aluru, S.: Distributed enhanced suffix arrays: efficient algorithms for construction and querying. In: SC, pp. 72:1–72:17. ACM (2019). https://doi.org/10.1145/ 3295500.3356211
	- 84. da Fonseca, P.G.S., da Silva, I.B.F.: Online construction of wavelet trees. In: SEA, pp. 16:1–16:14. Schloss Dagstuhl - Leibniz-Zentrum für Informatik (2017). https://doi.org/ 10.4230/LIPIcs.SEA.2017.16
	- 85. Fuentes-Sepúlveda, J., Elejalde, E., Ferres, L., Seco, D.: Parallel construction of wavelet trees on multicore architectures. Knowl. Inf. Syst. **51**(3), 1043–1066 (2017). https://doi.org/10.1007/s10115-016-1000-6
	- 86. Fuentes-Sepúlveda, J., Navarro, G., Nekrich, Y.: Parallel computation of the burrows wheeler transform in compact space. Theor. Comput. Sci. **812**, 123–136 (2020). https:// doi.org/10.1016/j.tcs.2019.09.030
	- 87. Furtak, T., Amaral, J.N., Niewiadomski, R.: Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms. In: SPAA, pp. 348–357. ACM (2007). https://doi.org/10.1145/1248377.1248436
	- 88. Futamura, N., Aluru, S., Kurtz, S.: Parallel suffix sorting. In: Electrical Engineering and Computer Science, vol. 64 (2001)
	- 89. Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10. 1007/978-3-540-30218-6\_19
	- 90. Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014). https://doi.org/10.1007/ 978-3-642-54423-1\_63

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Author Index**

Abu-Khzam, Faisal N. 97 Albers, Susanne 196 Angriman, Eugenio 3

Bingmann, Timo 252 Brandes, Ulrik 134

Castenow, Jannik 179 Chinazzo, André 39, 57

De Schryver, Christian 39 Dinklage, Patrick 252 Disser, Yann 152 Droschinsky, Andre 76

Feldkord, Björn 179 Fischer, Johannes 252

Giesen, Joachim 213 van der Grinten, Alexander 3

Hamann, Michael 3, 134 Hanselle, Jonas 179 Häuser, Luise 134 Hofheinz, Dennis 169 Humbeck, Lina 76

Kiltz, Eike 169 Klimm, Max 152 Knollmann, Till 179 Koch, Oliver 76 Kriege, Nils M. 76

Kuehne, Lars 213 Kurpicz, Florian 252

Lamm, Sebastian 97 Laue, Sören 213

Malatyali, Manuel 179 Meyer auf der Heide, Friedhelm 179 Meyer, Ulrich 21 Meyerhenke, Henning 3 Mnich, Matthias 97 Mutzel, Petra 76

Noe, Alexander 97

Ohlebusch, Enno 252

Penschuck, Manuel 3, 21

Sanders, Peter 252 Schäfer, Till 76 Schielke, Christian 229 Schiemann, Jan 229 Schryver, Christian De 57 Schulz, Christian 97 Srivastav, Anand 229 Strash, Darren 97

Wagner, Dorothea 134 Wedemeyer, Axel 229 Wehn, Norbert 39, 57

Zweig, Katharina 39, 57